Researchers propose a new, more efficient model for automatic speech recognition

CAAI Artificial Intelligence ResearchTsinghua University Press” width=”800″ height=”463″/>

The Phonetic-Semantic Pre-Training (PSP) framework uses “noise-aware program” learning to effectively improve ASR performance in noisy environments. integrating warm-up, self-supervised learning and fine-tuning. Credit: CAAI Research in Artificial IntelligenceTsinghua University Press

Popular voice assistants like Siri and Amazon Alexa have introduced Automatic Speech Recognition (ASR) to the masses. Although decades in the making, ASR models struggle to be consistent and reliable, especially in noisy environments. Chinese researchers have developed a framework that effectively improves the performance of ASR for the chaos of everyday acoustic environments.

Researchers from Hong Kong University of Science and Technology and WeBank have proposed a new phonetic-semantic pre-training (PSP) framework and demonstrated the robustness of their new model against speech datasets. very noisy synthetics.

Their study was published in CAAI Research in Artificial Intelligence August 28.

“Robustness has been a long-standing challenge for ASR,” said Xueyang Wu from the Department of Computer Science and Engineering at Hong Kong University of Science and Technology. “We want to increase the robustness of the Chinese ASR system at a lower cost.”

ASR uses machine learning and other artificial intelligence techniques to automatically translate speech into text for purposes such as voice-activated systems and transcription software. But new consumer-focused applications are increasingly demanding that speech recognition perform better: handle more languages ​​and accents, and perform more reliably in real-world situations such as video conferences and live interviews.

Traditionally, training the acoustic and linguistic models that make up ASR requires large amounts of noise-specific data, which can be time-consuming and expensive.

The acoustic model (AM) transforms words into “phones”, which are sequences of basic sounds. The language model (LM) decodes phones into natural language sentences, usually with a two-step process: a fast but relatively weak LM generates a set of candidate sentences, and a powerful but computationally expensive LM selects the best sentence among the candidates.

“Traditional learning models are not robust against noisy acoustic model outputs, especially for Chinese polyphonic words with identical pronunciation,” Wu said. “If the first pass of the learning model decoding is incorrect , it is extremely difficult for the second pass to catch up.”

The newly proposed PSP framework facilitates the recovery of misclassified words. By pre-training a model that translates AM outputs directly into sentences with full contextual information, researchers can help LM recover efficiently from noisy AM outputs.

The PSP framework allows the model to improve through a pre-training regimen called the Noise Sensitive Program that gradually introduces new skills, starting easy and gradually progressing to more complex tasks.

“The most crucial part of our proposed method, noise-aware curriculum learning, simulates the mechanism of how humans recognize a sentence from loud speech,” Wu said.

Warm-up is the first step, where researchers pre-train a phone-to-word transducer on a clean phone sequence, which is translated only from unlabeled text data, to reduce annotation time. This step “warms up” the model, initializing the basic parameters for mapping phone sequences to words.

In the second stage, self-supervised learning, the transducer learns from more complex data generated by self-supervised learning techniques and functions. Finally, the resulting phone-to-word transducer is refined with real-world voice data.

The researchers experimentally demonstrated the effectiveness of their framework on two sets of real data collected from industrial scenarios and synthetic noise. The results showed that the PSP framework effectively improves the traditional ASR pipeline, reducing relative character error rates by 28.63% for the first dataset and 26.38% for the second.

In the next steps, researchers will investigate more efficient PSP pre-training methods with larger unpaired data sets, seeking to maximize pre-training efficiency for the noise-robust LM.

Using Multitasking for Low Latency Speech Translation

More information:
Xueyang Wu et al, A phonetic-semantic pre-training model for robust speech recognition, CAAI Research in Artificial Intelligence (2022). DOI: 10.26599/AIR.2022.9150001

Provided by Tsinghua University Press

Quote: Researchers propose a new, more efficient model for automatic speech recognition (2022, September 2) Retrieved September 2, 2022 from html

This document is subject to copyright. Except for fair use for purposes of private study or research, no part may be reproduced without written permission. The content is provided for information only.

Source link

Comments are closed.