The Bot Framework now supports speech as a method of interacting with the bot across Webchat, the DirectLine channel, and Cortana. In this article we’ll go over the new capabilities, speech recognition priming using LUIS, and a new NuGet package we’ve released which supports speech recognition and synthesis on the DirectLine channel.
Web Chat now includes speech capabilities
If you’ve ever created and registered a bot on the Microsoft Bot Framework, you may already be familiar with the Web Chat channel. This channel is automatically configured for your Bot when it is registered, and enables users to interact with your bot via a web chat control using text. Developers can also easily embed web chat control in websites using the
<iframe> HTML element and passing in the bot’s credentials.
Below is a screen shot example of a web chat control connected to our sample trivia bot, notice anything different?
There’s a little microphone icon at the bottom right hand corner, which the user can use to initiate a conversation using speech and the bot’s response can also include a spoken utterance. Users can now easily have spoken conversations with bots, in addition to typing in text or selecting from UI touch menus.
To give this a try, check out the WebChat repo on GitHub and the sample at samples/speech/index.html. We’ll soon be updating the Embed Code for the Web Chat channel so you can opt in to speech without having to build the WebChat from the GitHub repo.
Updated Message Activity
For a bot to have speech enabled interactions, there are several components that a bot must be able to do:
1) Being able to understand a user’s speech 2) Being able to speak back to the user 3) Automatically listen if the user asks/asked a question 4) Stop listening or speaking if the user begins interacting in another way (text, touch)
Within the Bot Framework, we’ve made it very simple to implement speech-enabled conversation for bots using two new fields in IMessageActivity -
InputHint as shown below.
Speak field accommodates both plain text and SSML (Speech Synthesis Markup Language) which the Bot uses to specify how a client such at WebChat can synthesize audio to speak the response back to the user. Here’s an example from the sample trivia bot connected to the bot framework emulator:
In the ‘Details’ viewer in the emulator, we can view the json payload of a message. You can see SSML passed into the speak property of the json response message and within the SSML the phrase to speak. Note that SSML also supports pre-recorded audio using the audio element, so you can even include sound effects or pre-recorded phrases for deep customization.
You’ll also note the
inputHint property in the response which can be used for the bot to specify its current interaction mode, i.e whether the bot is accepting, expecting, or ignoring a user’s input. In this example, the Bot’s response is to ask a question and the inputHint is set to
expectingInput, meaning the bot is awaiting an input from the user. On many channels, this would open the microphone and enable the client’s input box, allowing a natural multi-turn conversation without the user needing to manually initiate speech recognition by clicking/tapping.
Now let’s take a look at the next response from the Bot after the user has answered the question:
The Bot informs us that the answer was incorrect, and the
inputHint field here is set to
ignoringInput. As the name suggests, in this message from the bot, it is ignoring user input (because it’s busy picking the next question to ask), and intends to send us subsequent messages. Depending on the client/channel, this may cause the client’s input box and microphone to be disabled.
Lastly, here’s an example of
acceptingInput after saying something to the bot it does not understand. The difference from
expectingInput is that the bot is not asking a question but is passively ready for input in case the user responds. Speech enabled clients would not automatically start listening but would allow the user to tap the microphone button or type to initiate a response.
As a developer, setting
inputHint is optional, the Bot Builder SDK will automatically implement the logic for you if you’re using system provided dialogs such as
Say, etc. If you’re creating custom responses, you should set
If you are building a bot on a speech-enabled channel such as Cortana, these are the only two properties you need to implement to construct messages which will be spoken by your bot. Told you we’ve made it simple!
Click here for the docs overview of how to add speech to messages.
Intent based Speech Priming for natural language
Without context, speech can often be misinterpreted. Different people can hear the exact same thing and interpret and understand it completely differently. Bots can misinterpret in the same way, and this often leads to unpleasant user experiences. For example, in a chess scenario, a user might say:
“Move knight to A 7”
Without context for the user’s intent, the utterance might be recognized as:
“Move night 287”
We now support speech recognition priming, which allows you to provide context via your bot, to ensure that speech relevant to your scenario is recognized accurately. Many bot developers already use LUIS to extract the meaning behind the user’s text-based input. LUIS is able to do this since it’s trained using example utterances to capture what the user is likely to say as well as the context. Speech recognition priming uses the utterances and entity tags in your LUIS models to improve accuracy and relevance while converting audio to text.
In our chess scenario sample, we created an intent called
MakeChessMove, and created two custom entities:
Then, we provide sample utterances that a user might say, and train and publish our model. How do we configure this model for speech? Good news, that’s the easy part!
When you register a new bot using a LUIS model, behind the scenes the Bot Framework will automatically leverage the contents of the LUIS model to create a speech model. After you create a bot, to enable speech recognition priming, simply go into your bot’s settings page, there is a new field for speech recognition options as shown below:
From here you only need to check the relevant LUIS models you want to associate with your bot to enable speech recognition priming. If you associate your LUIS app with your bot in this way, speech priming is enabled in Cortana, Bot Framework Emulator, Speech enabled webchat control, and DirectLine via the Microsoft.Bot.Client NuGet package. This speech model is automatically updated any time you train and publish your LUIS model.
Things to note:
- Speech recognition priming already supports LUIS built-in entities.
- For custom entities, we use the tags associated with the entity definition in your LUIS app
- LUIS Phrase list features (not to be confused with a closed list entity) are currently not used in priming speech.
If you find that a specific spoken phrase isn’t being recognized correctly, please add it as an utterance in your LUIS model. Similarly, if an entity value isn’t being recognized correctly, make sure you have an example utterance for that entity value and that the appropriate words are tagged as the entity.
Cross platform speech support in your app using the DirectLine channel
We’ve released a new NuGet package Microsoft.Bot.Client which allows you to embed your bot in your applications, and also supports both speech recognition and speech synthesis. The library supports both UWP and C# applications in XAMARIN, to allow developers to include speech enabled conversations with bots across different platforms (native iOS, Android, Windows).
NOTE: You will need to enable the DirectLine channel in your Bot to use this package
NOTE: Speech recognition is supported across platforms. However, intent based speech priming is currently only supported for Windows clients.
From our sample UWP application, to connect to the Trivia Bot, we create a new instance of the client with the proper authorization to access the bot’s DirectLine channel.
This client can be used to register for events related to speech recognition, speech synthesis, updating UI state, starting and ending conversation, etc.
Why would a developer want to use this NuGet? Well, it leverages the flexibility of the DirectLine channel, and allows you to develop custom UI for your application.
What is the DirectLine Channel anyway?
DirectLine channel allows the developer to connect to their bot from anywhere. This means that you can build a completely custom client and maintain complete control over the end to end experience. When you connect a bot to other channels, say Facebook, Skype, Cortana, Slack, etc. you are ‘locked in’ so to speak in developing specifically to accommodate those clients.
In fact, the Bot Emulator, which we commonly use to test and debug bots, is a web chat control instance using the DirectLine channel.
Using DirectLine, not only can you define your own UI with your own features, but you can connect multiple applications to a registered Bot simultaneously. First let’s take a look at the custom UI sample we built for the Trivia Bot using the new Bot.Client NuGet package.
Using the NuGet package within our sample trivia app, we have full freedom to design a new UX to interact with a bot. In the above screenshot, we added a UI meant to bring the feel of a gameshow, where the bot is the host and the user is standing behind a podium. We could also bind button clicks to the UI, making the whole application feel like a custom experience as opposed to the standard common chat interfaces. For example, the user could switch to the “Geography” category by saying something like “switch to the geography category” OR just clicking on the “Geography” category button on the podium. The UI also keeps track of the user’s score as the game progresses.
You can now type, tap and also talk to interact with your bot. We’ve made it easy for you to add speech recognition capabilities to your own applications that can connect to your Bot, and the speech recognition is primed and specialized to your bot using the LUIS models you already have!
We’ve released a new Nuget package - Microsoft.Bot.Client which enables voice communication for UWP and cross platform (via Xamarin) applications on the Direct Line REST API; and we’ve also updated the Webchat control to support speech interactions.
You can also check out our samples on Github
If you missed it at Build 2017, check out the Bot conversations for apps presentation which showcases all of the concepts we discussed above.
Khuram Shahid and Matthew Shim from the Bot Framework Team