Is speech recognition a viable technology for business?

How do you like the idea of speaking English and having software translate it and read it for you in Chinese in the matter of seconds? And what about the idea of controlling your intelligent house with your voice or generating transcripts from audio recordings instantly? This and much more is possible today. What can you do with it?

Before Microsoft revealed the incredible potential of its speech-to-speech technology that allows one to translate its own spoken English into an almost perfectly structured version of it in Mandarin Chinese (video below, from 7:30), the speech recognition software hadn’t been much on the radar of the average person or even entrepreneur.

If the demonstration hasn’t convinced you that it’s something worth looking deeper into, how about this: the speech recognition software market is going to grow at the compound annual growth rate (CAGR) of 40 percent until at least the year 2024. This translates (no pun intended) into a sixfold cold-cash-value increase – from about $250 million to just over $1.5 billion.

We’re going to take a look at a few really interesting use cases of speech recognition software, including one on our own. But first let’s go back to the basics and determine what speech recognition precisely is.

Speech recognition in custom web application development

First things first, speech recognition is a vastly different concept that voice recognition. The former refers to the ability of a machine to translate speech into text (and then, optionally, further process it into another different spoken form). Voice recognition on the other hand is based on the ability to recognize the vocal print of a person. Compared with the list of vocal prints available to it, a machine is able to verify the identity of the person at hand. While it’s obviously very fascinating and full of interesting use cases, we’re going to focus on speech recognition software…

…the modest beginnings of which reach as far as early 50ties and the research conducted at Bell Laboratories, a scientific development company currently owned by Nokia. Harvey Fletcher’s research and unexpected cooperation with the popular conductor Leopold Stokowski greatly contributed to the popularization of stereophonic sound. But even a few decades later, in 80ties, computers still struggled to correctly understand the spoken word aside of a very limited set of hard-coded phrases.

It wasn’t until the year 2006 when the speech recognition research got a serious boost. Professor Geoffrey Hinton took a new approach, replacing the commonly used statistical Gaussian mixture models with the artificial “brain-like” deep neural networks (DNN) as the basis for the speech recognition model. Scientists at Microsoft Research Redmon continued to train and improve the model for years, leading to very interesting discoveries. While the DNN approach didn’t eliminate errors in machine speech recognition, the nature of those errors changed when compared to the original statistical approach. Unlike before, the errors most of the time did not render the translations useless or incomprehensible. The improved speech-to-text technology encouraged the folks at Microsoft to further experiments with speech-to-speech. And this has eventually led to the powerful speech recognition software you could experience in the video above.

Speech recognition software – what can you do with it?

Let’s say you have an audio recording that you badly need to transcribe as soon as possible How much time would it take you to do that with a 5 minute recording? How about 15 minutes or 2 hours? It’s a tedious job. Especially when proper software can do it for you. Speechmatics is a startup that offers a cloud-based speech recognition software service. With minimal setup cost and simple browser registration you get access to a tool based on technologies that only recently have become available even to the largest organizations.

Speech recognition software is applied in many various fields. In education, it assists children and people with disabilities with writing. Some research suggests that allowing children with learning disabilities, including dyslexia, to use speech-to-text software to write their assignments highly improves the quality of their work and their ability to express thoughts as they temporarily don’t have to worry about spelling errors. In the video games industry, developers have been experimenting with speech recognition for quite a time. The 2004 video game Lifeline owes much of its popularity to an all voice-controlled interface, in which simple commands are used to interact with on-screen characters. Another remarkable example is the 2008 video game Tom Clancy’s EndWar, in which the player can issue voice commands to fellow soldiers. While not entirely all voice-controlled, the creators at Ubisoft also claimed that it could be controlled using voice commands alone.

Voice control-enhanced home automation

Home automation is another field, where speech recognition can be utilized very effectively. But it is one much closer to our hearts as the Digital Home Management System Sensinum developed for the Sustainable Infrastructure cluster makes a great use of it to provide additional functionalities for people with disabilities. At its core, the system is a KNX-based web application with an intuitive interface. The premise was to develop a system that is unique in its ability to easily collect data that can be further used to improve it. Aside of that, it’s equipped with standard features such as the control of lighting sources, retrieving data from wind temperature and movement sensors, controlling heating or air conditioning systems. But it stands out with providing the same manipulation ability to people with disabilities through voice control. It uses a simple pattern control mechanism – the user voices a command and the command is compared with the previously recorded patterns. This simple technology is a potential life changer for many people and can be easily combined with existing home automation systems.

Speech recognition software is just one of many examples of how a machine learning solution as simple as pattern recognition can be used to great effect. It’s popularly used to classify documents, for example in email clients, or to make content suggestions based on the videos/articles you have already watched/seen. It’s also a basis of more advanced systems such as Google’s big data machine learning platform TensorFlow (improved and released to the open source as DistBelief) that can be used to teach machines how to analyze the contents of images and text. At Google, it powers applications such as Google Photos and Google Search as well as the Street View service, in which the human face is detected based on a set of patterns so that a computer can blur it when a human gets into the field of vision of the Street View car.

As Sensinum is often employed for custom web application development that requires precise and advanced data processing, we often get to work on algorithms based on pattern recognition. Be it for speech recognition software or any other field, we love to enrich our web applications with such capabilities. Do you have an idea for a web app that could use some? Let us know – we would love to talk to you about it.

Sensinum is a Polish software house that provides top-shelf software development services to companies, marketing agencies and teams. Thanks to the outstanding experience in working as a subcontractor or in cooperation on external projects, Sensinsum brings the best results to the table. Our talented developers that work on advanced projects on a daily basis will be happy to work with you. Don’t hesitate to ask anything, contact us and consult your software idea for free.