At this years’ SIGdial conference, researchers from the Technical University of Munich, department of informatics published the following paper –
Evaluating Natural Language Understanding Services for Conversational Question Answering Systems
Currently, there is no established way to evaluate different NLU (Natural Language Understanding) services. One of the primary goals of the research team was to define a way to compare these services to enable users to make more educated decisions about which service to use for their use cases.
The team compared performance between four services – LUIS, Watson, API.ai and RASA, by provisioning each service using large data sets of questions. The data was gathered from multiple sources, a Telegram production chatbot, and two StackExchange platforms – ask ubuntu, and Web Applications. All of the services were provisioned and trained with exactly the same data – which can be found here.
Note: Other NLU services such as Amazon Lex and wit.ai were excluded from the comparisons as they currently do not include sufficient batch import functionality for the study.
In order to evaluate the results, the research team calculated statistical true positive, false positive, and false negative values based on exact matches from the NLU services, and calculated an F-score for each service (an F-score is a measure of accuracy, based on the average of precision and recall).
Note: Current results are only a snapshot of the current state of the compared services, as especially cloud based services may change over time.
Based on the data, LUIS showed the best results! Be sure to check out the full paper for details.
Happy Making!
The Bot Framework Team.