Last year our speech team in ASUS attended the Formosa Speech Recognition challenge workshop which is the first competition focus on Mandarin speech recognition in Taiwan. There were a total of 26 teams, including 8 industrial, 14 academic and 5 individuals to build the first benchmark of Taiwanese ASR (Automatic Speech Recognition) systems.

We are excited to say we stood out from competitors and won the prize of the Best Industrial System. Our commercial LVCSR (Large Vocabulary Continuous Speech Recognition) system reaches 8.1% CER (Character Error Rate) which is better than iFlyTek (18.8%) and Google (20.6%). Also, our system has 13.8% relative error rate reduction than others.

What is the focus of FSW?

Although speech recognition is well known, most of speech recognition products people commonly use are still the cell phone voice assistant, smart speaker product or voice search. Consider of the usage scenario among these products, it is not hard to find they have similar sentence patterns or grammars to tell machine a simple action. For example: “Turn on the radio”, “How’s the weather today” and so on.

Many companies claim that they can build a LVCSR system to handle this situation well. The reason why they can claim this is that collection of this kind of data is easy and domain constrains make the relative tasks become easy. However, these tasks just fulfill some special situations of our language behavior. Thanks to FSW for providing the broadcast speech as the evaluation data. Unlike the command base speech, broadcast speech is much closed to people talking behavior which should take more complex factors such as multi speakers, complex environment and no boundary topic domain into consideration.

Why we can do that?

The key reason is that we have strong ability to adjust and adapt both acoustic model and language model to any domain we want to apply.

Unseen speaker is an old problem in the speech recognition. Usually we can make our acoustic model to have strong robustness ability through collecting lots of data from Internet and then let DNN model to learn from the data. In this competition, we also tried our feature based speaker data augmentation method and successfully got significant effect on FSW final testing, especially for children speech data.

On the other hands, we built up a training flow which collect the data such as YouTube or TV news to avoid training over fitting, these unannotated data will be passed to the SAD module to get rid of useless data. Then these data were used in our semi-supervised training with a low factor to control the effects between human annotated data and machine annotated data.

“Carefully consider any situations that we will meet.”

Compared to other foreign language such as English, Mandarin is a more difficult task for speech recognition. The reason is our language allows arbitrary character combination which creates different words with different meaning, we call this word segmentation. To handle this issue, Lexicon becomes a key role in our language model training.

ASUS speech team already work on Mandarin speech recognition for almost 10 years. During these years, we collect and cleanup our lexicon day by day. Currently, we already have almost 230 thousand words in our LVCSR system, and build up a daily hot word extraction system to collect the newest words such as ‘貿易戰’ and ‘秋行軍蟲’ from each social media .

“As long as you keep working on it, you will be successful in the end.”

Although FSW is an offline competition so participants don’t need to consider of the decoding time, we still built up our LVCSR system with low RTF (Real Time Factor) constraint. Even know that the larger model will get better performance but sacrifice decoding speed, we still try to get balance between user experience and competition criterion. I am glad to say we did it.

“AICS team always think for our users.”

After this competition, we have launched our online service in the AICS API platform. This ASR service can satisfy the streaming decoding for real time speech transcription task. Applications such as voice commands and controls, social chat voice messages to text, and human-machine conversations can be developed quickly and easily using this API. Please contact us if you aren’t waiting to give a try.

Reference: