Robot language "genius" emerges

How does a smart robot communicate with people? Simple instruction input can no longer meet this fast-paced society, if it can be voiced. However, most intelligent robots do not fully recognize and respond to languages. Some can only recognize Putonghua, while others can only respond one-on-one. If it is a multi-user chat mode or a noisy background, smart The robot will "dizzy" and lose its way.

It is not easy to solve this problem. However, in the fourth international multi-channel speech separation and recognition contest held in San Francisco, USA, the Chinese team completed speech separation and English recognition in the context of six-microphone, dual-microphone and single-microphone. The task is to win the championship. The most important problem solved by this technology includes speech recognition in a noisy environment. The winning team of this competition came from Science and Technology News. At present, they have applied this technology to the human-computer interaction solution named AIUI.

Talk to the robot

Voice recognition technology, from small cell phone instructions to smart home control, is reflected, which makes life easier. But this is not the ultimate goal of speech recognition. If in a noisy environment, multiple people give orders to the same intelligent robot, who should listen to the latter and how should it respond? International multi-channel speech separation and recognition competition is how to solve this type of problem.

If you want to understand multi-channel speech, first understand the concept of speech recognition. First of all, speech recognition refers to the conversion from speech to text, which means that the machine can understand what people are saying. This includes two layers of meaning. One layer refers to converting the words spoken by the user from word to word and sentence to text. The second is to correctly understand the requirements contained in the speech and make correct responses. Among them, the intersectional linguistics of signal linguistics, signal processing, pattern recognition, probability theory and information theory, phonation mechanism and auditory mechanism, and artificial intelligence are frontier technologies in the field of Chinese information processing. The main problem solved is how to convert text information into information. For audible sound information.

The processing of human voice by intelligent robots is completely different from humans' understanding. They first break down successive sentences into units such as words and phonemes, and read the meanings on the basis of understanding the rules of semantics. If the voice of the talking body is blurred or the accent is heavy, the intelligent robot cannot recognize if it has not set relevant rules. Even the tone of a person speaking seriously and speaking casually sounds different in an intelligent robot. Coupled with the ambient noise during the acquisition of the sound, etc., will have an interference with the robot, which in turn will increase the error rate of speech recognition. Multi-channel speech recognition means that voice recognition will be more accurate after collecting sound sources through multiple microphones and using microphone array technology to reduce noise.

In an interview with a journalist from the China Science Journal, Zhao Yanjun, director of R&D of the AIUI of HKUST, said that AIUI's echo cancellation, confidence judgment, continuous speech decryption and other technologies can be used to interrupt the machine at any time. AIUI supports 3 to 5 meters of recognition distance in far field recognition, and the recognition rate reaches 90%. "AIUI also supports dialect recognition, full-duplex interaction, and automatic error correction. At the same time, users can effectively reject knowledge when communicating with a machine," said Zhao Yanjun.

AIUI is currently one of the human-computer interaction solutions being developed in the world. As one of the mainstream configurations of intelligent robots in the future, commercialized voice interaction platforms are also the focus of research and development of many IT giants. For example, Microsoft's Speech API is an application programming interface (API) from Microsoft that includes a speech recognition (SR) and speech synthesis (SS) engine. Based on the Windows platform, it can read English, Chinese, Japanese, and so on. Another giant is IBM, which was one of the institutions that started research on speech recognition earlier. In 1984, IBM's speech recognition system achieved a 95% recognition rate in 5000 vocabularies.