Skip to main content

Speech Recognition Challenges in Lo Kar Lo Baat

When HUL came to us with an idea to implement a phone app to connect migrant workers for free, we took it as a challenge to make it happen as it was for a good cause.

And we are proud of the final system which came up. When we ran a sample set of voices through different recognizers, we found that our system outperformed even the Google speech recognizer. Our success rate at recognizing phone numbers in Hindi and English(all 10 digits) was almost 2 times that of Google.
(note: We ran the samples through Google's speech cloud to get their recognition rate)
Granted, ours was a specialized system and Google's was generic, but hey, we still feel proud :)

Challenges in Lo Kar Lo Baat :

Generally, accuracy of speech recognition depends on many factors like;

1) channel property, speech sampling rate, speech coding

2) deviation in pattern due to effect of dialects, speaking rate and speaking style

3) effect of background

In Lo Kar Lo Baat we had faced many challenges like;

1) To improve user experience and reduce human errors (i.e. typo)/machinery errors (i.e.network problem, error recording system, inaccurate transmission of DTMF input .. etc.),we have proposed an automatic language recognizer which detects language automatically and routes the application to language specific speech recognizer.
Again it was too challenging because of accented speaking, speaking style .. etc. The language recognizer (Hindi and English digit discriminator) is susceptible to above mentioned problems and working with an accuracy of 96 to 98%.

2) When someone evaluates a speech recognition system with word accuracy it to easy to tell above 90% or so. The main challenge lies in sentence recognition (for a positive recognition, a sentence has to be 100% correct, means all the words should be recognized correctly). Which was not a easy case. Our speech recognition system was tuned to work for sentence recognition. This was necessary as a phone number has 10 digits(or 1 sentence with 10 words). Now all systems give 90% accuracy, but at word level. So we get 9 digits right. Which is good for a speech recognizer, but useless for phone number recognizer.

3) Next challenges in speech recognition are accented pronunciation and effect of dialect. A country like India is having nearly 27 major dialects or languages. People from each dialects speaks English or Hindi, but their accent used to be different. Apart from that in many dark areas, where people not much exposed to technology have used our system and their accents are  quite different from above mentioned major dialects and accents. Our system is tuned to work on these kind of environment to some extent.

4) Our speech recognition model is trained and fine tuned for telephonic environment, so it is susceptible to measure problems like channel noise, clipping, effect of noise on speech

5) Another major issue, we have faced is, speaking rate. When someones speaking rate is very high its very hard to distinguish between speech patterns. Our system is fine tuned to handle these kind of patterns to some extent.

6) We had used a speech filter which handles unexpected acoustic patterns and helps speech recognition to improve its accuracy. This filter is 99% accurate. It suggests to the speech recognizer what to recognize or what not to.

Popular posts from this blog

Cloud Telephony-History and state of the art

Well, its been 11 years since Twilio launched their voice API in November 2008. I would say that was a major turning point in the cloud telephony industry. Before that, for people to build telephony applications, you either had to depend on proprietary platforms like Avaya dialog designer or build on arcane technologies like VXML which again was supported at varying degrees by the incumbents. Enter Twilio with their voice API and the industry changed for the better. Since it's been almost 11 years now I thought now might be a good time to do a comprehensive review of the cloud telephony industry as a whole in general and in India in particular. The Beginning Twilio was undoubtedly the startup which ushered in the era of cloud telephony. They started in November 2008. At that time in India, we at Ozonetel had launched a hosted VXML platform. There were no takers. After all who coded in VXML :) So when Twilio launched and we saw them take off, we immediately realized tha...

Google business messages and chat agents-A match made in heaven

Google has launched Google business messages without much fanfare. It's just a small button that pops up when someone searches for your business on Google. But from the conversation industry perspective this is HUGE .   Do you know that the small call button drives millions of calls i n a year for pizza joints and other retailers in the US. Businesses spend more than a trillion dollars supporting billions of customer service calls each year. Now imagine how many chat conversations the "Message" button can drive.  Think of how customers interact with business. 1. Search on Google. 2. Click on web site link. 3. Web site shows chat pop up and tries to force the user to chat.(Annoying. I know :)) 4. User clicks on chat and starts conversing with a bot or an agent. This flow can now be completely changed. The new flow can be: 1. Search on Google. 2. User clicks on Message and starts conversing with a bot or an agent. What if you could design a customer experience that helps...

Telugu ASR speech data collection

Image Source: IIIT-H Developing an indigenous ASR for Indian languages has been a goal for us since a long time. In that regard we have been experimenting a lot, trying out various neural network architectures.  While doing these experiments we found that there was no good dataset for Indian languages. While discussing with IIIT professors we got to know that the government of India was also exploring options to generate a good dataset. We immediately offered our help and our platform for this endeavor. So, as a starting step we have come up with a few campaigns to encourage users to donate speech data. We wanted to make it fun, so our first few campaigns are along the lines of JAMs(Just a Minute speech topics) etc. A topic will be provided and you need to speak for a minute on that topic. We have started this campaign for college students to start with. Of course anyone can participate and contribute their data. The more the merrier :) We will adding a lot more innovative ways ut...