Skip to content

The Future of Voice — why the marriage of voice and screen is critical to its development.

My first foray into voice was developing an app which would ask users to list whatever general ingredients they’d like to cook with, and provide them with recipes which they could make based on what they had. In order to make the experience better for them, I assumed they had common ingredients like salt, flour, milk, eggs, etc. while it didn’t catch on quite as well as I would have liked, it nonetheless proved to be an incredibly effective experience in learning about how to design for voice, what sorts of things to take into consideration when making a voice app, and how to create user experience which were compelling enough to make them keep coming back.

In order to save some space in this particular article, I’ll save the details for a future article, but suffice it to say, a few things showed themselves as being absolutely key. The first: accuracy can be sacrificed in favor of getting reliability. Unless it’s for some very niche application (medicine, note taking), 100% accuracy in understating users isn’t something which needs to be prioritized too much. It is, of course, nice to have that degree of accuracy, but it’s not something which will make or break voice products. What’s far more important is being able to handle a wide range of user interactions, and give them something back no matter what they ask. There are an almost infinite number of ways a person can ask what they can make with spinach. Or chicken. Or beets. However, all of these variations of asking what one can make with *Insert whatever you have in your fridge here* all lead to one result, and that is an algorithm finding the closest matching recipe to what you have. Vagueness into specificity.

In almost any case, it’s better for a voice first product to say “I don’t know how to do that”, than it is for it to say “Sorry, I couldn’t understand what you were saying”. Any situation where users have to repeat what they’re trying to say is a situation in which users get frustrated and quickly stop using the platform. It’s early, and everyone knows that a voice ai won’t be able to do everything. People are OK with this. What they’re not OK with is having to repeat questions, phrases, and sentences.

The second important thing I learned was that voice alone simply wasn’t enough. While voice is an incredible way of interacting with technology, It needs to be tied into and used in conjunction with other mediums, namely screens. Getting a summary of your daily schedule or the weather or jumping through songs are things which can easily be done entirely through voice, but anything which requires a higher degree of interactivity and a larger amount of information being presenting needs to be made available on a screen.

My recipe app was a great example of this. Listing what ingredients you had, the type of meal you wanted to make, and choosing a recipe were perfectly situated to a voice only interface. However, reading the actual recipe itself, and continually referencing it through voice alone was a sub par experience. It worked, but it wasn’t a compelling enough experience to make people come back to it on a regular basis. At the time, the only voice-first devices I could build for were voice only, so there weren’t too many options for revamping the experience and making it a better one. However, I did create a large number of mockups and simulated experience to prototype what a voice-screen experience would feel like. Suffice it to say, it was a good one. All interactions were done conversationally and through voice, but critical and long form information was also given through the user on a device screen, which offered them a more tangible way of navigating the app, and having much more content visible than they could remember if someone had just said it to them.

I had also worked at a voice first startup for a little while. We were building a voice only web browser and a few products aimed at the dental and medical community. It became obvious, though, that any voice product without an accompanying screen was ultimately going to be a failure, unless great care was taken to make the voice only experience succinct, fast, and clear enough to avoid user frustration.

Voice is an incredible medium, and offers such an amazing degree of flexibility in designing user experiences, but needs to be paired with an additional interface to offer the most comprehensive usability. This is where multi modal comes into play, and this is where the Alexa’s recent developments, both in product and in developer tools, really shine. Echo show devices have been around for a little while, but it was only in November of 2018 where developers got the ability to design really rich experiences for screen enabled Alexa devices.

These features (called APL — Alexa Presentation Language) allow developers to create a seamless integration between the voice interface which users interact with, and the display which shows them what they want to see. In a sense, It’s analogous to designing voice-first web pages. All the navigation, pagination, and design are oriented towards supplementing a voice interaction with relevant content such as video, text, graphics, etc.

As soon as these features became available, I started rebuilding the Yada platform to take advantage of them, and in early testing of the changes, we immediately saw what sort of impact a truly multi modal experience could have. These results reaffirmed the assumption that the best way to build for voice isn’t to limit yourself to voice, but to use voice and conversation in a natively integrated way with content better suited to screens.

Of course, building for voice is a continually evolving process, and it may be awhile before best practices are truly solidified and understood, but I’m confident that placing a priority on a multi modal product is the right thing to do, and is the best way to give users as much functionality as possible.