Voice Controlled Games: The Rise of Speech Technology in Gaming

Last Updated June 22, 2017

The steadily growing speech technology industry has indeed found a comfortable home in, well, our homes.

As research assistants, DJs and chief technology officers, speech technology connects all the smart devices in your house from thermostats to security systems.

Not to mention, its debut in our smartphones, voice technology inside our cars, and even speech-enabled workplace assistants.

Needless to say, unique voice-first experiences are already taking the world by storm.

However, outside of these use-cases in which speech recognition technology is implemented with the intent to simplify our lives, it’s also making strides in other areas.

Enter: the rise of speech technology in gaming.

Goals and Challenges

Creating a video game is already extraordinarily difficult.

It takes years to properly flesh out the plot, the gameplay, character development, customizable gear, lottery systems, worlds, game mechanics, what languages will be available, recording voice-overs with actors, deciding between designing a FPS (first person shooter) versus RPG (role playing game), and more.

Hundreds of moving parts have to come together seamlessly.

Not only that, but the game has to be able to change and adapt based on each player’s actions.

How would your interaction with a weapon-maker change based on what items you have in your bag? Or, if you spoke to a townsman before receiving your mission versus after.

In open-world games in particular, there are simply so many scenarios that the game developers have to run through, test, create content for, and that needs to be able to adapt to your character’s actions in-game.

Now, just imagine adding another level to gaming through speech recognition technology.

Many of the companies championing this idea do so with the intention of making gaming more accessible for visually and/or physically impaired people, as well as allowing players to immerse themselves further into gameplay through enabling yet another layer of integration.

Voice-control could also potentially lower the learning curve for beginners, seeing as less importance will be placed on figuring out controls; player’s can “just” begin talking right away.

These ideas are sure to have avid gamers sitting on the edges of their seats in anticipation.

However it’ll also be extremely challenging for game developers who will now have to account for hundreds (if not thousands) of hours of voice data collection, speech technology integration, testing and coding in order to retain their international audience.

Developers have to take into account accents, dialects and whole languages on top of baseline video game localization for players in different cultures.

Not to mention, gathering all the different potential phrases a user would say during the game or command their character with.

Only recording and implementing a handful of expected phrases without taking into account natural language utterances means the player might never say the “correct” phrase to trigger a response.

A lot more than one might realize goes into making speech recognition work smoothly.

After all, there’s nothing more frustrating than having to repeat yourself over and over again, only to be met with your character’s impassive-yet-expectant stare.

So, let’s take a walk-through the rise of speech technology in gaming: where we’ve been, where we are, and what we’re expecting to see more of.

The Rise: Early Stages

It’s fair to assume that, as with any new technology, those who first pioneer change will also be the ones dealing with the challenges and bumps in the road that come with the unchartered territory.

Bot Colony

North Side Inc. was one of these pioneers. They saw the potential for speech recognition far before it was popularized with Siri’s Apple iPhone debut. Thus, taking on the challenge of building a full-fledged 3D graphics video game centered around their speech recognition technology. The game was named Bot Colony: a sci-fi story in which players find themselves in a technology-driven future where robots have become ubiquitous helpers in human society.  

North Side’s technology, the core of which is “Natural Language Understanding” (NLU) and processing, recognizes natural speech patterns rather than being limited to overly rigid or limited static command phrases, so the player’s can command the robots remotely using speech. At least, that’s the idea behind it. Actually playing the game is an entirely different experience.

“The reality of Bot Colony is that it’s as frustrating as it is inspiring” says Josh Tolentino of Game Critic. “Even after multiple minutes-long “training sessions” designed to get the computer to accurately recognize the player’s voice… questions and commands will need to be stated and restated multiple times to get [the bots] to respond properly”. Ultimately concluding that gameplay was much more efficient when the microphone was muted and a keyboard was used to type in questions and commands manually.

Though Bot Colony isn’t quite the success story North Side was hoping for, Tolentino concluded with the “undeniably attractive… prospect of being able to simply “talk” to a machine in the way one talks normally, and have it respond in a (somewhat) intelligent manner”.

It’s this attraction – this magical moment – that keep developers chugging along in search of a better way to bring speech technology to video games.


In the late 1990s, Leonard Nimoy (whom you know most famously for his role playing Spock in various incarnations of Star Trek) provided voice-over narration for what Matt Vella of Time Magazine named, “one of the strangest, most wonderful experiments in video game history: Seaman”. Released for Sega’s Dreamcast console, the Japanese game put a virtual pet in the care of players who were charged with feeding, nurturing and guiding its evolution from sea to land. Having first been released in 1999, the speech recognition system was understandably a little lackluster (player’s spoke to the creature via a microphone accessory plugged into their controller), leading to tedious gameplay and controls. However, it was still lauded for its creativity and in fact even received an Excellence Award for Interactive Art at the Japan Media Arts Festival.

Perhaps it was the wonderment of Nimoy’s voice which elevated the success of the game – or perhaps it was the wonderment of having a fish with a man’s face respond to your questions with their own witty responses. Either way, player’s were hooked to not only being able to control, but rather being able to communicate with their characters.

The Rise: Speech Recognition Meets New Technology

Speech recognition isn’t the only “new” technology to be entering the spotlight in the gaming industry. Classic controllers made way for motion sensing input devices such as Xbox 360’s Kinect, which have since made way for full-blown virtual reality such as the HTC Vive. However, many of these same early adopters have also jumped on board with speech recognition.

Mass Effect 3

Back in 2012, Microsoft was struggling to convince the world of gaming that motion sensor controllers would be the next big thing; that they would elevate and improve a person’s gaming experience rather than hamper it. However, with the release of Mass Effect 3 – which was a Kinect title as well as being one of the most anticipated games of the year – all that was about to change. Aimed at making the gameplay as easy and intuitive to pick up as possible, Mass Effect 3’s Kinect integration was drop-dead simple: it was voice-controlled. Player’s shout commands to their squad mates, and they follow your orders. Simply tell your character to switch to your weapon of choice, open a door, or activate a skill, and they will.

Players can also use the Kinect microphone in dialogue sequences. Traditionally in RPG games, you use your controller buttons to pick phrases from the game’s discussion wheel to direct the flow of the protagonists responses. With the added speech recognition technology, player’s can do this with their voice instead; choosing to voice the phrase out loud or simply tell their character to be a jerk to spur them into the more jerk-like dialogue option.

In games such as Mass Effect 3, speech recognition technology radically simplifies the game’s controls while simultaneously allowing more complex gameplay action. It’s more immersive – not only because your yelling actually results in in-game action, but also because you never have to break the cinematic experience to pause your gameplay and jump into an abstract menu screen.

Pairing Kinect with speech recognition technology means that you’ll not only find players jumping in front of their screens acting the movements out; you’ll also find them feverishly yelling commands at their squad mates as though they were on the battlegrounds themselves.

Oculus Rift

Oculus Rift debuted a voice-controlled search feature early in 2017, appropriately dubbed “Oculus Voice”. The Oculus Voice overlay offers users the ability to speak the commands in the Oculus Home menu system; a previously onerous task that forced users to scroll through dozens of apps, page by page, one at a time. With the overlay, the process is more intuitive and more efficient. A small, but meaningful improvement in Oculus’s user experience thanks to speech recognition technology.


At its core, Flying Mollusk’s Nevermind is an adventure-horror game where you must dive into the minds of psychological trauma patients for whom traditional treatment methods have proven ineffective. In-game, you are a “Neuroprober” – a physician who, through the use of cutting-edge technology, is able to enter the surreal subconscious of these victims and solve the puzzles of their minds and unlock the root of their trauma.

Speaking of cutting-edge technology, the Windows and Mac versions of Nevermind uses biofeedback technology to detect the player’s feelings of stress and excitement while playing. “When you start to become scared or anxious, the game will dynamically respond to those feelings, which in turn direct affects gameplay” explains the game’s website. While the biofeedback functionality is not essential to play Nevermind, if you have an Apple Watch, a Garmin, or a handful of other heart rate monitors and cameras at your disposal, you can plug in and start experiencing the tech for yourself.

If this all sounds familiar to you, it might be because the wildly popular mini-series, Black Mirror, recently released an episode exploring biofeedback technology and virtual reality in a terrifying episode titled, Playtest. Though Nevermind was originally developed to help players “become more mindful of their feelings of stress and anxiety levels and help them practice managing those feelings” the dystopian implications of such technology is clearly at the forefront of everyone’s minds.

Nevermind’s biofeedback technology, first-person viewpoint and, you guessed it, voice-activated controls bring it to a whole new level of immersive and smart gameplay. Flying Mollusk even tackled the challenge of bringing Nevermind to an international audience; Globalme helped conduct video game localization and localization testing for Nevermind into ten languages, including German, Spanish, Japanese, Korean, Russian, and Simplified and Traditional Mandarin Chinese.

Sponsored by Intel’s speech recognition RealSense technology and localized by the one and only Globalme, Nevermind’s voice-control is a natural addition to the already pimped-out and technologically-forward video game.

Voice-Activation Meets the App Store

In a recent study by VoiceLabs, 30% of respondents noted games and entertainment as their primary reason for investing in an Amazon Echo or Google Home. And, according to eMarketer, 35.6 million Americans used a voice-assistant device like Amazon Echo at least once a month in 2017, while 60.5 million Americans used some kind of virtual voice assistant such as Siri in 2017.

Alexa Skills

Let’s take a closer look at one of the major game-changers out there for speech tech: Amazon. Responsible for creating one of today’s most well-known smart speech recognition systems, Alexa, Amazon also invests in voice-activated game development through various accelerator programs. Such as, Techstars x Amazon’s Alexa Accelerator, which accepted its first batch of incubator companies mid-2017. But, what has really propelled Amazon’s Alexa forward as a bonafide platform (not just the intelligent software behind a few connected devices), is Amazon’s Alexa Skills Kit (ASK). Alexa Skills allows third-party developers to create apps and tap into the power of Alexa without ever needing native support. In other words, video game designers can utilize the thousands of data strings Amazon has already collected and integrated into Alexa to their advantage.

Alexa Skills: The Wayne Investigation

The Wayne Investigation is a skill developed by Warner Bros. to help promote Batman v Superman: Dawn of Justice. Combining the speech recognition technology and produced audio assets (namely, compelling music and sound effects), The Wayne Investigation seems almost like an old-timey radio show with a modern technological twist. “Alexa, open The Wayne Investigation” are the five simple words you need to utter to get started.

The game models a “decision tree” format, explains Alexa Blogs writer Emily Roberts. “From three starting actions, users can make up to 37 decisions, each taking the user down paths that lead the player to new and iconic Gotham characters and locations before completing the game”. During the first week, the game engaged seven times more (per weekly average) than all other skills combined; earning the top spot for both total time spent engaging with the skill and average time spent per user.

Other such skills have been created such as, Runescape Quests Skill: One Piercing Note, The Baker Street Experience, The Magic Door, and many more.

Simplifying Voice: Chicken Scream

Clearly, there are a host of games utilizing speech technology already out there – with more being developed no doubt. Some are based on much more complicated technologies, whereas some are more broken down and simplified. We recently discovered one of these simplified voice-activated apps for ourselves: Chicken Scream.

Chicken Scream is different from other games in that the only way to move your character along its journey across dangerous moving landmasses, bridges that collapse after walking on them, and floating spiked balls of doom is by your voice. Stay silent to keep the chicken still, speak to move the chicken forward, and squawk loudly to make it jump.

The unique controls offer a nice change of pace from other games you find in the app store, and the removal of “actual” speech recognition (that is, the game doesn’t need to understand and interpret certain words or phrases) means the game can be used by anyone, anywhere around the world; there aren’t any language limitations or barriers for localization here. In fact, we even tested the app using chicken sounds from different languages around the world.

This app, and others like it, are less intimidating examples of voice-activated technology.

A New Era

Video games have always been concerned with blurring the lines between art and real life. Photorealistic 4K graphics, the disintegration of levels into vast open worlds, the introduction of virtual reality placing players into the body of another person or character: the implicit goal of every technological advancement is to create something immersive and indistinguishable from real-life. However, aside from just a few adventurers into the space of speech recognition, the technology has been largely untouched in the realm of video games. Instead, we still favor the classic joystick/button and keyboard/mouse combinations to communicate with and control our characters on-screen.

Games utilizing speech recognition, though praised for their innovation, have yet to truly succeed in their mission – never moving past the novelty aspect of it all. The rise of entities and systems whose sole focus is speech data collection, processing and recognition such as Nuance, Amazon’s Alexa, Google Assistant, Apple’s Siri and Microsoft Cortana as well as the growth of voice-activated smartphone games might just be bringing with it the rise of speech recognition as mainstream.

Just think – if hugely popular video game developers such as Bandai, Blizzard, EA, or Nintendo wanted to jump on the speech-recognition technology bandwagon with minimal effort (in terms of actually conducting voice data collection themselves), merging with these dedicated voice-tech systems could be a happy medium. And, if these partnerships do indeed start to form, then the future of speech recognition technology in video games looks to be incredibly accurate, localized, and totally ubiquitous.

We’re already starting to hear about highly-rated voice-activated games entering the atmosphere – perhaps the era of the joystick is drawing to a close.

What do you think?

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More