Previously in part 1, we gave an updated overview of the Recognizers-Text library which is used to power many of the prebuilt entities in LUIS. We provided a step-by-step guide to creating your own language specific definitions for both the .NET and Javascript versions of the project in YAML, and generating the platform-specific definitions using the tools already provided by the project. Recall that by the end of the post, we were left with new definition files –
The process for fully adding a new language is rather long, so we’ve split the process into three general steps –
- Define language specific definitions – Part 1
- Implement language specific extractors & parsers – this post
- Test extraction and parsing to verify the new language patterns, refine and repeat
Overview
From part 1, we managed to generate some language specific regex patterns. Great, but the recognizers still need to consume these definitions somehow. The next step is to create a language model which can actually use those definitions in a meaningful way. In Recognizers-Text, this is performed by two key components in the program – Extractors and Parsers.
Extractors allow relevant entities to be recognized and pulled from a user’s query, and after that the Parsers are used to provide that entity with a resolution. For example, using the number recognizer, a query with a number entity might be – “I see two cars.” In the extraction phase, “two” might be extracted. The parser is then responsible for recognizing “two”, and provide a resolution to resolve it to the numerical representation “2”.
Both of the .NET and Javascript libraries for the project include powerful base classes for extracting and parsing entities. This makes it far easier for us to extend the recognizers to support new languages, as we can inherit these abstract base classes into language specific configurations.
.NET
Extractors
Recall that we’re using the Numbers recognizer (in English) to explain the concepts for the library. In the solution explorer for Visual Studio you’ll find see the Extractors folder, inside are the base classes which provide most of the heavy lifting to extract an relevant entity from a query. All of the other language specific extractors inherit from these base classes.
As an example, let’s examine DoubleExtractor.cs from the English folder.
DoubleExtractor inherits the BaseNumberExtractor, which is what will actually perform most of the work when extracting an entity from a query. The class creates a dictionary of the relevant regex patterns needed to extract that specific entity, from the language-specific definitions created prior. Different languages may require different sets of regex patterns for different entities, and is up to the developer to determine how many and what definition patterns are needed for the entity.
Parsers
The parser is responsible for providing resolution for extracted entities from a query. Like the extractors, every recognizer project (Number, Units, DateTime) includes re-usable base parser classes, which the language specific implementations will use.
Looking at the English language specific number parser – EnglishNumberParserConfiguration.cs, it inherits the INumberParserConfiguration class.
public class EnglishNumberParserConfiguration : INumberParserConfiguration { public EnglishNumberParserConfiguration(): this(new CultureInfo(Culture.English)){ } public EnglishNumberParserConfigugration(CultureInfo ci) { this.LangMarker = NumbersDefinitions.LangMarker; this.CultureInfo = ci; this.DecimalSeparatorChar = NumbersDefinitions.DecimalSeparatorChar; this.FractionMarkerToken = NumbersDefinitions.FractionMarkerToken; this.NonDecimalSeparatorChar = NumbersDefinitions.NonDecimalSeparatorChar; ... this.CardinalNumberMap = NumberDefinitions.CardinalNumbeRMap.ToImmutableDictionary(); } // Properties public CultureInfo CultureInfo { get; private set; } public char DecimalSeparatorChar { get; private set; } ... // Methods for language specific formatting as required }
Similar to the extractor, the regex patterns needed to parse are taken from the language definition file, and declared as class properties. Additionally, here is where we could add some custom methods to resolve any language/culture, or entity specific formatting.
Registering the language model
Finally, the last thing we need to do is register our language model with the Recognizer project. For numbers, we can do this in NumberRecognizer.cs. Each recognizer will have a similar class responsible for registering different language-specific entity models to the project.
Simply include the appropriate namespace at the top, and you can register a new language model in the constructor following the format of the existing languages. After this, we can rebuild the solution and the new language model should be incorporated into the recognizer project, and ready to test!
Node.js
The Javascript version of Recognizers-Text also follows the same pattern of extracting and parsing.
- Extractors – used to extract relevant entities from a query
- Parsers – used to provide resolution for an extracted entity
Each recognizer project includes base classes for extracting and parsing, which language specific targets must implement.
Typescript is used to write the extractors and parsers, and we need an npm dependency called ts-node in order to run some scripts later on. If you followed up from part 1, you may already have this installed. If not, simply go into your terminal and run the following command to install ts-node globally to your machine:
npm install -g ts-node
And if you do not have typescript, please install it by running the following:
npm install -g typescript
Extractors
Below is a snippet taken from extractors.ts in the English folder of the number recognizer package.
import { BaseNumberExtractor, RegExpValue, BasePercentageExtractor } from "../extractors"; import { Constants } from "../constants"; import { NumberMode, LongFormatType } from "../models"; import { EnglishNumeric } from "../../resources/englishNumeric"; import { RegExpUtility } from "@microsoft/recognizers-text" export class EnglishNumberExtractor extends BaseNumberExtractor { protected extractType: string = Constants.SYS_NUM; constructor(mode: NumberMode = NumberMode.Default) { ... } } export class EnglishCardinalExtractor extends BaseNumberExtractor { ... } export class EnglishIntegerExtractor extends BaseNumberExtractor { ... } export class EnglishPercentageExtractor extends BasePercentageExtractor { ... }
From the code snippet above, all of the different extractors relative to the target language are consolidated in the single typescript file. Each extractor contains a different array of regex definitions to build for the target entity.
It is up to the developer to define how many regex definitions are needed for an extractor, in addition to what those definitions may be. On an abstract level, the developer needs to think of carefully to create high level abstractions for the entity in question (say, number) in different re-usable ways to fit the needs of the language. Note that this is this most difficult part of creating your language model, and you may often find that you need to edit your language definitions and re-generate the regex definitions in an iterative manner.
Parsers
Similarly, a language specific parser will be defined in typescript, and implement the entity specific base parser. Regex definitions from the generated files will be used to provision the static properties of the parser, consider parserConfiguration.ts –
Registering the language model
After the new language specific extractors and parsers are defined, the new language model must be registered with the project. In recognize-text-number.ts, simply import the relevant extractors and parsers using typescript’s module resolution syntax, and add the new language’s configuration to the class constructor.
Compiling the typescript parsers and extractors to javascript
From your terminal, cd into the root directory for the Javascript project, and run the following command –
npm run build
This will build the recognizers within the Recognizers-Text/Javascript/Packages folder, including compiling all of the associated typescript files in each. Each recognizer within the packages folder also contains it’s own typescript configuration, using the recognizer-number package again as an example, the typescript configuration as follows:
{ "compilerOptions": { "module": "commonjs", "target": "es2015", "outDir": "compiled", "sourceMap": true, "rootDir": "src", "moduleResolution": "node", "declaration": true, "declarationDir": "dist/types", "allowSyntheticDefaultImports": true, "typeRoots": [ "node_modules/@types" ] }, "include": [ "src" ] }
The output for the compiled .js files are located in a folder called ‘compiled‘, while the typescript declaration (.d.ts) files are located in an associated directory called ‘dist/types‘. Using the windows file explorer, we can clearly locate these two directories.
Recognizers-Text/JavaScript/packages/recognizers-number –
And that’s it! You should now have a reasonable road map to create extractors and parsers for Recognizers-Text in JavaScript. Language specific extractors and parsers can take advantage of the abstract base classes already provided by the library, and adjusted to consume different regex definition files for different languages.
Summary
Recognizers-Text is an open source library the LUIS team released last year, which provides robust recognition and resolution for common units. The library itself serves as a powerful core which performs all of the heavy lifting for the extraction and parsing entities. When it was originally released, only three languages were supported: English, Spanish, and Chinese (mandarin). Thanks to open source contributions, French and Portuguese have also been added, with more currently in development.
In the final post of this series, we’ll discuss how to actually verify our a new language model by creating unit tests, run tests, and how you can debug the process.
Thanks for tuning in again! We hope these posts invite and encourage you to create your own open source contributions to this project.
Happy Making!
Matthew Shim from the Bot Framework Team.