Finally forcing computers to think like humans

On 17th June 2020 I attended my first AWS summit. New to the industry, it was one of the first tech conferences I’d been to (other than the Women in Development conference earlier this year, back in that beautiful time when people were allowed to meet in groups…) and was certainly the first virtual online conference I’ve been to. I didn’t quite know what to expect, but it ended up being an intense, engaging half-day where I learnt a LOT.

The agenda for the day was jam-packed and very diverse – 5 sessions over the course of the morning, with 11 different speakers to choose between for each one. The sessions were helpfully labelled with things like “I’m a data scientist”, “I work with emerging tech” or even “I’m new to the cloud” to help guide you to the right sessions for you, but really there were no limits and we were encouraged to branch out and try new things.

The introductory keynote talk given by Dr Wernel Vogels, CTO of Amazon.com set the excited tone for the rest of the day. Dressed in a black BATTLEBOTS t-shirt, he was full of child-like passion and enthusiasm, and his talk was very relevant to these strange times we’re living through, with interesting references to increased media consumption during lockdown, and projects that AWS is involved in to help keep isolated people and communities connected.

Before becoming a software developer this year, I worked in the charity industry for 6 years. Since my career change I have learnt that the world is not, as I had naïvely thought, split into worthwhile causes and evil corporations. The summit was another reminder that there are so many opportunities to use tech for good in the world. The keynote talk introduced amazing organisations like Wefarm and Nextdoor that are using AWS to make their work possible. There was also a fascinating (though all-too short) session from Ana Visneski, Principal Technical Program Manager of the AWS Disaster Response, where I was delighted to learn not only that AWS has a Disaster Response team, but that they have been using technology to respond to disasters like Hurricane Dorian in 2019, the Mt. Kilauea volcano eruption in Hawaii in 2018, and of course, most recently and relevantly, the COVID pandemic.

I did attend a couple of “Introduction to…” sessions, but unfortunately in my experience, any “introductory” sessions less than an hour never seem to be appropriate for a beginner audience. Speakers can, in the same breath, go from explaining what the service even does, to meticulous detail of price plans in terms of GHz, Tbs and RAM. (That said, I did once attend an excellent 2-hour introduction session to S3, so it is possible!)

The highlight of my day (appealing to my geeky linguist background) was a session called “Build and deploy an Amazon Lex chatbot and convert it to an Alexa Skill”, expertly delivered by Sohan Maheshwar, Developer Advocate at AWS.

As Sohan said, when you think of computing devices, you might think of PCs, laptops, mobile phones, even thermostat displays or TVs with remotes. With the exception of the introduction of touch screen devices in the 2000s, user interfaces haven’t changed much since keyboards from the 70s and mice from the 80s. Recent advances in machine learning and natural language processing have for the first time made “conversational interfaces” possible, such as Amazon Alexa or Google Home. Sohan pointed out that this is a game-changer in the industry for 2 main reasons: firstly that it lowers the barrier into technology, and secondly – and the most powerful statement of the day for me – because “so far, we as humans have been forced to think like computers… but with conversational interfaces, computers are being forced to think like humans.”

Amazon Lex is the AWS service for building conversational interfaces using both voice and text. It is a complete service, so you don’t need any machine learning experience, just a little bit of code. Lex can deal with recognising speech, converting speech to text, managing dialogue, deploying to multiple platforms and even provides analytics to help improve your chatbot.

If a user interacts with a Lex chatbot via speech, Lex uses ASR (Automatic Speech Recognition, the same speech API as Alexa) to convert speech to text, with a lot of language-specific context for accuracy. Once the speech is converted to text, or if a user interacts with the chatbot via text, it uses NLU (Natural Language Understanding, the same language API as Alexa) to convert natural, human conversation into a structure that computers can understand. The Lex chatbot sends this structure to the back-end of your application, where it can process it and return a response, which the Lex chatbot then presents to the user.

The structure that computers use to understand human conversation is built up of ‘utterances’, ‘intents’ and ‘slots’. In order to demonstrate what these are, Sohan gave this simple example of a chatbot program to order flowers:

The intent is the appropriate action in response to the user’s input. I like to think of it as the function you want the chatbot to call. For example, in the interaction above, the intent might be called `OrderFlowers`.

The utterances are the spoken or typed phrases from the user that trigger this intent. In the interaction above, “I would like to buy some flowers” is the utterance. Every utterance needs to be mapped to an intent, but there can also be MANY different utterances for the same intent, such as “I would like to buy some flowers”, “I want to buy some flowers”, “I’d like to order some flowers”, “I want to purchase some flowers”, etc. When building your chatbot you need to try to imagine all the utterances your users might come up with, and map all of them to your intent.

The slots are all the pieces of information or data that the chatbot needs in order to complete the action. You can think of them as the parameters to the intent function, or simply all the information needed to complete the order or request. In the example above, the slots are the type of flower, the date of flower collection and the time of flower collection. The slot would be `FlowerType` and the slot value would be `Tulips`. These values will vary from user to user.

It is very unlikely that a user will give the bot all the slots it needs to complete the action in one go. Humans tend to prefer back and forth conversation like above, rather than “I would like to order some tulips to be collected at 9am tomorrow.” To obtain all the slots of information they need, Lex bots use prompts, which can be either spoken or typed. Each slot is mapped to a prompt so that if the user doesn’t provide all the information, the bot can ask each prompt until it has enough information to fulfill the intent. For example, in the conversation above, the FlowerType slot is mapped to the prompt “What type of flowers would you like to order?”.

Alexa ‘Skills’ are the equivalent of apps in the Apple or Play Store. There are over 100,000 Skills, and anyone can build a voice-based bot for free and upload it to the Alexa Skill store. Sohan logged into AWS Lex to demonstrate that it is very easy to export the front-end of a Lex bot (that is, its intent, prompts and utterances) as an Alexa Skill to then upload to the Skill store.

Finally, Sohan mentioned some interesting key features to bear in mind when designing apps for text or for voice.

Apps designed for text are designed to be written and read. Therefore you might want to personalise your app’s messages with images and emojis. You can afford to give the user more information, as they have time to process and re-read if necessary. You can also present multiple options, for example: “Would you like fries or salad?”

When designing apps for voice, however, they are designed to be heard and spoken to. Therefore you could personalise with speech and sound effects. You should keep the information brief, as users cannot concentrate to the same extent as if they’re reading. Amazon even has a “one breath” rule: if you can’t say your content in a single breath, then it’s probably too long! You should also present definitive choices, as if you ask a user verbally “Would you like fries or salad?”, they might simply reply “Yes”! It’s better to phrase it as “Which would you like? Fries or salad?”
If you’re interested in finding out more about building your own Lex bot and/or deploying it as an Alexa Skill, there is bountiful documentation available on AWS here: https://docs.aws.amazon.com/lex/latest/dg/what-is.html

I know I’ve certainly started imagining what ‘utterances’ our customers might use if they were going to submit a Resolver complaint through Alexa, which ‘slots’ of information we’d need in order to submit their complaint for them, and what ‘prompts’ we might choose to gather those slots… Watch this space, a Resolver chatbot might be available on the Alexa Skill store before you know it!

By Elise Aston