The stack of conversational UI
The building blocks required to develop a modern and interactive conversational application include:
- Speech recognition (for voicebots)
- NLU
- Conversational level:
- Dictionary/samples
- Context
- Business logic
In this section, we will walk through the "journey" of a conversational interaction along the conversational stack.
Voice recognition technology
Voice recognition (also known as speech recognition or speech-to-text) transcribes voice into text. The computer captures our voice with a microphone and provides a text transcription of the words. Using a simple level of text processing, we can develop a voice control feature with simple commands, such as "turn left" or "call John." Leading providers of speech recognition today include Nuance, Amazon, IBM Watson, Google, Microsoft, and Apple.
NLU
To achieve a higher level of understanding, beyond simple commands, we must include a layer of NLU. NLU fulfills the task of reading comprehension. The computer "reads the text" (in a voicebot, it will be the transcribed text from the speech recognition) and then tries to grasp the user's intent behind it and translate it into concrete steps.
Lets take a look at travel bot, as an example. The system identifies two individual intentions:
- Flight booking – BookFlight
- Hotel booking – BookHotel
When a user asks to book a flight, the NLU layer is what helps the bot to understand that the intent behind the user's request is BookFlight. However, since people don't talk like computers, and since our goal is to create a humanized experience (and not a computerized one), the NLU layer should understand or be able to connect various requests to a specific intent.
Another example is when a user says, I need to fly to NYC. The NLU layer is expected to understand that the user's intent is to book a flight. A more complex request for our NLU to understand would be when a user says, I'm travelling again.
Similarly, the NLU should connect the user's sentence to the BookFlight intent. This is a much more complex task, since the bot can't identify the word flight in the sentence or a destination out of a list of cities or states. Therefore, the sentence is more difficult for the bot to understand.
Computer science considers NLU to be a "hard AI problem"(Turing Test as a Defining Feature of AI-Completeness in Artificial Intelligence, Evolutionary Computation and Metaheuristics (AIECM), Roman V. Yampolskiy), meaning that even with AI (powered by deep learning) developers are still struggling to provide a high-quality solution. To call a problem AI-hard means that this problem cannot be solved by a simple specific algorithm and that means dealing with unexpected circumstances while solving any real-world problem. In NLU, those unexpected circumstances are the various configurations of words and sentences in an endless number of languages and dialects. Some leading providers of NLU are Dialogflow (previously api.ai, acquired by Google), wit.ai (acquired by Facebook), Amazon, IBM Watson, and Microsoft.
Dictionaries/samples
To build a good NLU layer that can understand people, we must provide a broad and comprehensive sample set of concepts and categories in a subject area or domain. Simply put, we need to provide a list of associated samples or, even better, a collection of possible sentences for each single intent (request) that a user can activate on our bot. If we go back to our travel example, we would need to build a comprehensive dictionary, as you can see in the following table:
Building these dictionaries, or sets of samples, can be a tough and Sisyphean task. It is domain-specific and language-specific, and, as such, requires different configurations and tweaks from one use case to another, and from one language to another. Unlike the GUI, where the user is restricted to choosing from the web screen, the conversational UI is unique, since it offers the user an unlimited experience. However, as such, it is also very difficult to pre-configure to a level of perfection (see the AI-hard problem above). Therefore, the more samples we provide, the better the bot's NLU layer will be able to understand different requests from a user. Beware of the catch-22 in this case: the more intents we build, the more samples are required, and all those samples can easily lead to intents overlapping. For example, when a user says, I need help, they might mean they want to contact support, but they also might require help on how to use the app.
Context
Contextual conversation is one of the toughest challenges in conversational interaction. Being able to understand context is what makes a bot's interaction a humanized one. As mentioned previously, at its minimum, conversational UI is a series of questions and answers. However, adding a contextual aspect to it is what makes it a "true" conversational experience. By enabling context understanding, the bot can keep track of the conversation in its different stages and relate, and make a connection between, different requests. The entire flow of the conversation is taken into consideration and not just the last request.
In every conversational bot we build – either as a chatbot or a voicebot – the interaction will have two sides:
The end user will ask, Can I book a flight?
The bot will respond, Yes. The bot might also add, Do you want to fly international?
The end user can then approve this or respond by saying, No, domestic.
A contextual conversation is very different from a simple Q&A. For the preceding scenario, there were multiple different ways the user could have responded and the bot must be able to deal with all those different flows.
State machine
One methodology for dealing with different flows is to use a state machine methodology. This popular and simple way to describe context connects each state (phase) of the conversation to the next state, depending on the user's reaction.
However, the advantage of a state machine is also its disadvantage. This methodology forces us to map every possible conversational flow in advance. While it is very easy to use for building simple use cases, it is extremely difficult to understand and maintain over time, and it's impossible to use for more complicated flows (flight booking, for example, is a complex flow that can't be supported using a state machine). Another problem with the state machines method is that, even for simple use cases, to support multiple use cases with the same response, we still need to duplicate much of the work.
Event-driven contextual approach
The event-driven contextual approach is a more suitable method for today's conversational UI. It lets the users express themselves in an unlimited flow and doesn't force them through a specific flow. Understanding that it's impossible to map the entire conversational flow in advance, the event-driven contextual approach focuses on the context of the user's request to gather all the information it needs in an unstructured way by minimizing all other options.
Using this methodology, the user leads the conversation and the machine analyzes the data and completes the flow at the back. This method allows us to depart from the restricting GUI state machine flow and provide human-level interaction.
In this example, the machine knows that it needs the following parameters to complete a flight:
- Departure location
- Destination
- Date
- Airline
The user in this case can fluently say, I want to book a flight to NYC, or I want to fly from SF to NYC tomorrow, or I want to fly with Delta.
For each of these flows, the machine will return to the user to collect the missing information:
By building a conversational flow in an event-driven contextual approach, we succeed in mimicking our interaction with a human agent. When booking a flight with a travel agent, I start the conversation and provide the details that I know. The agent, in return, will ask me only for the missing details and won't force me to state each detail at a certain time.
Business logic/dynamic data
At this stage, I think we can agree that building a conversational UI is not an easy task. In fact, many bots today don't use NLU and avoid free-speech interaction. We had great expectations of chatbots and with those high expectations came a great disappointment. This is why many chatbots and voicebots today provide mostly simple Q&A flows.
Most of those bots have a limited offering and the business logic is connected to two-to-three specific use cases, such as opening hours or a phone number, no matter what the user is asking for. In other very popular chat interfaces, bots are still leaning on the GUI, offering a menu selection and eliminating free text.
However, if we are building a true conversational communication between our bot and our users, we must make sure that we connect it to a dynamic business logic. So, after we have enabled speech recognition, worked on our NLU, built samples, and developed an event-driven contextual flow, it is time to connect our bot to dynamic data. To reach real-time data, and to be able to run transactions, our bot needs to connect to the business logic of our application. This can be done through the usage of APIs to your backend systems.
Going back to our flight booking bot, we would need to retrieve real-time data on when the next flight from SF to NYC is, what seats are still available, and what the price is for the flight. Our APIs can also help us to complete the order and approve a payment. If you are lacking APIs for some of the needed data and functions, you can develop new ones or use screen-scraping techniques to avoid a complex development.