Last year, we announced a new recognizer library by the LUIS team, which provides robust recognition and resolution for common units expressed in everyday human interaction. Since then, the code base has changed considerably, and the library has been expanded to include more pre-built entities including date-time, currency, dimensions, and age.
Today, we’ll take a look into the code base of Recognizers-Text, and go through the process of how to actually extend the project to support new languages, using number units in English as an example. Hopefully, by the end you will be confident enough to fork/clone the project for yourself and begin creating a contribution of your own! Even if the particular language is already supported in the library, it can always be improved. This will actually be the first post in a series, which will provide an introduction to the contents of the project, and detail how to generate the definition files which serve as the primary starting point for extending Recognizers-Text to support new languages.
- Git Hub account
- Microsoft/Recognizers-Text repo – fork a copy to your GitHub account, and clone a local copy to your machine
To start, you may use any text editor of your choice. The two environments currently supported are .NET and Node.js. Currently, only the nuget package for .NET is available, but an npm module to support Node.js will be published in the near future!
Inside the Repo
Every type of unit recognizer within the project applies two primary steps – extraction, and parsing. When a user’s utterance is processed by LUIS, the same steps are applied for entity recognition, first the relevant entities are extracted from the utterance, then parsed to provide it meaning, or, resolution. In Recognizers-Text, the heavy lifting to perform these steps is already provisioned as language-agnostic base classes.
In the .NET number recognizer project, you can find the BaseNumberExtractor, and BaseNumberPaser. These two ‘base’ classes are inherited by language specific extractor and parser classes. For example, this CardinalExtractor.cs class inherits from the BaseNumberExtractor, and is specific to the English language (note the namespace). Similarly, a language specific parser inherits from an agnostic base parser.
But how can the program know what the relevant units are to extract in the first place? This is where the language-specific regular expression definitions come in. Custom language patterns and definitions are needed in code, to define how the same units can be expressed through different languages.
The workflow to extend a new language can be mapped out like this:
- Define language specific definitions
- Implement language specific extractors & parsers
- Test extraction and parsing to verify the new language patterns, refine and repeat
We’ll start with step 1, defining language specific definitions.
Language specific patterns
This is where you should start if you are adding a new language contribution. In this folder, are the sub-categories of the different spoken languages currently supported, or in development.
For more information on YAML format refer to the Official YAML Website
Above is a snippet taken from the English-Numbers.yaml file. These .yaml files are where we will provision the individual regular expression patterns used to build the static regular expressions the program will use to match against a user query. The different regular expression patterns represent abstractions of the entities in the language to target. For example, in the snippet above, a ZeroToNineIntegerRegex is defined. It is a simpleRegex type (explained below), and a definition is provided which describes the regular expression string. In this case, as the name implies, the integers zero, one, two, three etc are represented in this regular expression. Below is a list of different types you can currently use to build the regular expression patterns:
!char – Used for simple char or string constant definitions.
!simpleRegex – Used for regex patterns which don’t contain other regexes or parameters within, simply constant strings
!nestedRegex – A regex pattern composed of other regex definitions
!paramsRegex – A regex that is parameterized. This is similar to a nestedRegex notation, but uses a function-like implementation
!dictionary – A dictionary with a key – value pair associative structure, used for the resolution of recognized expressions.
!list – Used to define multiple different possible values for any basic data type.
The process of defining these abstractions for a new language can be quite challenging! Not only does it require a reasonable understanding of the language you’re trying to implement, but you need to be able to define the abstractions for the language in a re-usable way, and also create valid regular expressions to handle the many cases in which a user could express that entity. To get started, don’t worry too much about getting every expression exactly right, you can always come back to the .yaml files and regenerate them later on. To take a look at how the other units are defined, you can refer to –
After the regular expression patterns are defined in the .yaml files, the definitions are ready to be generated. The next two sections will detail how to do this for .NET and Node.js respectively.
Generating platform-specific definitions – .NET
First we need to set a target path for the new definitions file we will be generating. To do this, in the Microsoft.Recognizers.Definitions project of the solution, open the folder for the language we’ll be generating definitions for (if it doesn’t exist, go ahead and create it). Next, create a Definitions.tt (T4) file for the type of recognizers you are creating. For example, we are currently trying to compile the English-Numbers.yaml file for number entities, so we’d name this file NumberDefinitions.tt, similarly, for date time, we’d name this file DateTimeDefinitions.tt. In this file, we define the path for the source file name, as well as some meta data for the language, and the class name. You can use the following general format for these files:
<#@ template debug="true" hostspecific="true" language="C#" #> <# this.DataFilename = @"Patterns\Your-Language\Name-Of-Yaml-Source-File"; this.Language = "Your-Language"; this.ClassName = "Name-Of-This-TT-File"; #> <#@ include file="..\CommonDefinitions.ttinclude"#>
Click here to see the English NumbersDefinitions.tt file as a reference. The CommonDefinitions.ttinclude file sets the template configuration which generates the definition files from the yaml files for each language.
When the .tt (T4) file is completed, we are ready to generate the definitions. Right click on the Definitions.tt file, and select Run Custom Tool.
Running the tool for NumberDefinitions.tt, generates NumberDefinitions.cs, as shown below. This file is what will be used by the rest of the .NET project to extract relevant units out of a user’s query, and then parse them for a proper resolution.
Some things to note:
- Start with Numbers. The other recognizers (datetime, units) are dependent on numbers to work.
- Tabs are NOT valid in .yaml files, use spaces only. If a tab is detected, your definitions will fail to compile and you will see an error like this:
- Luckily, if you ever run into this problem, Visual Studio has a handy built-in feature to remove tabs. Select all relevant text, then select Edit –> Advanced –> Untabify Selected Lines
- To support accented characters, in your regex pattern, you can include them as below, wrapped in square brackets ” [ ] “. Characters within the square bracket allow for switch casing for each character inside.
- This snippet is taken from the French-DateTime.yaml file, and for the month of December (décembre), allows the user to express the month using the accented ‘é’, or without. Lastly, the .yaml file needs to be saved with encoding Unicode (UTF-8 without signature).
Generating platform-specific definitions – Node.js
Install the ts-node npm module to your machine if you don’t already have it, by running the following command in your terminal:
npm install -g ts-node
The resource-generator folder contains a Node.js program which generates the namespaces in src\resources of each different recognizer, and also provides the path to the source definitions from the .yaml definitions in the Patterns folder.
Opening the recognizers-number folder, is a file called resource-definitions.json. This file contains the configuration settings for the definitions output in different languages for the specific recognizer. Each recognizer within the packages folder also contains a similar file to generate it’s own definitions from a target YAML file.
To generate the definitions in the Node.js version of Recognizers-Text, add language the new language configuration in resource-definitions.json using the same format as shown below. Once the target language configuration is added to the file, the regex definitions from the .yaml file will be ready to compile. Note that the English-Numbers config we’ve been examining for this guide is already included.
From your terminal, cd into the root directory for the recognizer – packages/recognizers-number, and run the following command –> npm run build-resources :
This will generate a typescript file for every language configured in the resource-definitions.json file. These files will be located in src/resources.
A new folder will also be generated called compiled. From the typescript file, a java script file will also be generated and located in packages/recognizers-number/compiled/resources
This concludes the first part of our series on the Recognizers-Text repo we released as open source last year. In this article we walked through the project structure in addition to showing you where to get started writing regex patterns in YAML format, which is used to generate definition classes for both the .NET and Node.js projects of the Recognizers-Text library.
In the next article, we’ll go over how to actually use these custom definitions by defining language-specific extractors and parsers, and how to configure them to work with the project. Lastly, we need to provision the language model with unit tests to verify proper recognition, extraction, and resolution.
We hope this article has helped to provide you with a deeper understanding of how this recognizer library works, and gives you the confidence and inspiration to get started on a new open source contribution of your own. In the future, we hope to extend this project to support as many languages as we can manage! When you’re ready, just create a pull-request under a new branch, and the LUIS team will review it.
Matthew Shim from the Bot Framework Team.