Architecture of a conversational user interface
In this section, let's take a look at the basic architecture of a conversational interface:
The core module of a conversational interface is the conversation manager. This module controls the flow of the conversation. It takes the semantic representation of what the user says as input, and decides what the response of the system should be. It will maintain a representation of the conversational context in some form, say a set of key value pairs, in order to meaningfully carry out the conversation over several turns between the user and the system.
The semantic representation of the user input can be directly fed from button pushes. In systems that can understand language, user utterances will be translated into semantic representation, consisting of user intents and parameters (slots and entities), by a natural language understanding module. This module may need to be previously trained to understand a set of user intents identified by the developer pertaining to the conversational tasks at hand.
Voice-enabled interfaces that accept user's speech inputs also need a speech recognition module that can transcribe speech into text before feeding it into the natural language understanding module. Symmetrically, on the other side, there is a need for a speech synthesizer (or text-to-speech engine) module that converts the system's text response into speech.
The conversational manager will interact with backend modules. It can be a database or an online data source that gets queried in order to answer a user's question (for example, TV schedule) or an online service to carry out a user's instruction (for example, booking a ticket).
The channel is where the chatbot actually meets the user. Depending on the channel, there may be one or more modules that make up this layer. For instance, if the chatbot is on Facebook Messenger, this layer consists of a Facebook Page and a Facebook App that connects to the rest of the chatbot modules wrapped as a web app.