<iframe src="//www.googletagmanager.com/ns.html?id=GTM-WHQ8DN" height="0" width="0" style="display:none;visibility:hidden">

How Baidu's Deep Speech 2 Is Winning The Speech Recognition Game

Joe Milazzo | October 18, 2016 | | 1 Comment
baidu_deep_speech_2.jpg

Baidu is described as “the Google of China” so often that it has become a cliché. However, Baidu may have actually surpassed Google in the realm of speech-recognition technology. If you’ve ever asked your smartphone or PC’s “virtual assistant” a question, you’ve interacted with speech recognition (or intelligent voice) technology). Baidu’s nurturing of exceptional talent has certainly played a role in their being discussed as a potential industry “game-changer”: Chief Scientist Andrew Ng created the algorithms that still power the Android OS voice search functionality. He also helped to found the online learning platform Coursera, and currently splits his time between Baidu’s Beijing campus and Stanford University. Baidu’s investment in its Deep Learning Institute has also paid significant dividends, and they have made great strides in truly useable (rather than merely entertaining) augmented reality (AR). 

Speech-recognition technology is commonly associated in the West with luxury, as evidenced by its close association with the Apple iOS and upscale lifestyle products such as smart appliances and Amazon’s Echo. Nevertheless, as with almost all innovation, necessity plays a role. How is it that China is becoming a global leader in creating advanced technology that actually understands and responds to the natural (and complex) ways we communicate with each other? To answer this question, we have to gain an appreciation for what makes both written and spoken Chinese so unique among the world’s languages.

How do you talk to a computer in Chinese?

English-speaking smartphone users know how frustrating typing on tiny virtual keyboards can be. Even composing a simple query such as “Find the closest gas station” can be a headache. Imagine trying to input the same request in Mandarin. Instead of 26 letters, 10 numerals and a discrete collection of common symbols, you have to be familiar with nearly 3,000 characters to consider yourself literate. While Mandarin can be represented using the Roman alphabet, or transliterated, via Pinyin, and while many computer interfaces will automatically convert Pinyin input into on-screen Mandarin, this method requires a secondary level of fluency. Touch screens hold out the promise of being able to input characters stroke by stroke, but, with mobile devices, whatever gains users may experience in accuracy are offset by non-trivial losses in convenience and efficiency.

Related: How to Type in Chinese

Even in Pinyin, Chinese syntax differs significantly from English, the lingua franca of the Internet. For example, verbs in Mandarin are not conjugated, and Mandarin speakers do not differentiate between past, present and future actions by using different words. Finally, as Vanessa Wong noted in 2012 on the eve of Apple’s introduction of Siri to the Chinese market, intonation is everything, and presents distinct challenges to any machine attempting to interpret the subtleties of the human voice. “The words for mother (妈 mā), scold (骂 mà), and horse (马 mǎ), for example, all sound like ‘ma’ but with different intonation. Developing a software that can understand the sentence ‘Mother scolds the horse’ (妈妈骂马 māmā mà mǎ) is no easy task.”

Deep learning and deep listening with Baidu’s Deep Speech 2

For all these reasons and more Baidu’s Deep Speech 2 takes a different approach to speech-recognition. Deep Speech 2 leverages the power of cloud computing and machine learning to create what computer scientists call a neural network. Think of a neural network as a computer simulation of an actual biological brain. These neural networks emphasize connection and communication over the computation of statistical probabilities based on established data. To put it simple, neural networks are machines that learn.

According to initial reports from MIT, the neural network behind Deep Speech 2 utilizes high-capacity “graphics processors” to mine Baidu’s massive collection of user-submitted voice data and so “runs seven times faster” than comparable systems. Deep Speech 2 scientist Jesse Engel further claims that Deep Speech 2’s capacity has allowed Baidu to “reduce the [program’s] word error rate by 40 percent.”Further, a recent experiment pitting Deep Speech 2 against data entry using the traditional QWERTY keyboard demonstrated that, in Mandarin, “speech was 2.8 times faster, with an error rate 63.4 percent lower than typing.

However, Deep Speech 2 is not a native Mandarin-speaker. Adam Coates, who directs the US-based AI Lab where Baidu’s scientists did the bulk of their development work, has revealed that Deep Speech 2 was trained first in English and only gradually introduced to Mandarin. In Coates words, “because [Deep Speech 2] is all deep learning-based it mostly depends on data, so we were able to pretty quickly replace it with Mandarin data and train up a very strong Mandarin engine." In other words, Baidu’s scientists first ensured the Deep Speech 2 mastered a certain number of essential linguistic concepts before learning any one specific language. And, similar to the neural network that learned to recognize a cat by looking at millions of images of cats, Deep Speech 2 learns by listening, building up both its vocabulary and a rare sensitivity to the contextual qualities of speech. With such a strong foundation, Deep Speech 2 may ultimately help scientists to realize the dream of a universal translation engine—a machine that can recognize any language almost instantaneously and simultaneously translate it into any other language.

What’s driving the growing demand for intelligent voice technology?

Deep Speech 2 is a major development in China due to its potential appeal to a huge population that speaks many different regional dialects and is characterized by a diverse array of accents. China boasts almost 1 billion smartphone users. These 1 billion smartphone users are generating an ever-growing demand for intelligent voice technology. Industry reports estimate that the market value of Chinese firms with a share of the intelligent voice sector increased by over 50% in the past year. []. That 50% represents over 350 million USD. 

Google is clearly aware of the giant speech recognition strides being made in China. It recently overhauled its voice search platform, and has rebranded the various services users are accustomed to activating with “OK, Google” as Google Assistant. Google CEO Sundar Pichai has explained these changes by referring to the company’s push to put "AI first.” After the lukewarm reception consumers gave its now-discontinued Glass accessory, Google has ground to make up in integrating software and hardware in ways that Baidu and fellow Chinese tech leaders AliBaba and Tencent already do.

Related: Five Chinese Tech Companies And Their Global Equivalents

So Google’s first investment in China since 2010 has been with Mobvoi, developers of Chumen Wenwen, a voice search technology that integrates with WeChat, China’s the most popular messaging app. Google’s hope is that Mobvoi’s expertise will help to jumpstart their efforts to rival Apple and its successful line of smartwatches and other wearables—all of which operate almost exclusively via voice commands. (via Bloomberg.)

The future of speech recognition: bright, and very competitive 

Baidu is clearly not content to limit itself to any single market, however large. On October 3, they announced the availability of their free TalkType app for Android devices. Exclusive to the United States and designed for English-speakers, TalkType uses Deep Speech 2 technology to replace a phone’s or tablet’s default keyboard entry for web searches, messaging and any other functionalities with simple dictation. As Baidu’s Bijit Halder explains: “TypeTalk is the first full-function Android keyboard that is 'voice first,' not 'voice also.' Unlike conventional keyboard designs, where voice is targeted for occasional use and delegated to a small icon, TalkType is designed for voice as the primary input mode." ]

In the future, it should come as no surprise if Baidu and other Chinese intelligent voice developers continue their international expansion. While their innovations have been driven by the particular needs of Mandarin speakers, Baidu’s discoveries have wider applications. Regardless of the language in which it is first expressed, a good idea is a good idea. And, when it comes to high tech and the disruption that more and more rules the marketplace, the best ideas almost always win out.

Want To Learn Business Chinese? Join TutorMing to learn more!
Book A Free Business Chinese Class!
Joe Milazzo

Joe Milazzo

Joe Milazzo is a writer, editor, educator, designer and cultural observer. He holds a MLS from the University of North Texas and a MFA from the California Institute of the Arts. He has a long-standing interest in Chinese art and literature. You can learn out more about his work by visiting http://www.joemilazzo.net/. Joe lives and works in Dallas, TX.

Subscribe to Email Updates

Lists by Topic

Read More!

TutorMing News