AI listens by also seeing

Meta AI has released a self-supervised speech recognition model that also uses video and achieves 75% higher accuracy for a certain amount of data compared to current state-of-the-art models.

This new model, Audio-Visual Hidden BERT (AV-HuBERT), uses audio-visual features to enhance models based solely on speech hearing. The visual features used are based on lip reading, similar to what humans do. Lipreading helps filter out background noise when someone is speaking, which is an extremely difficult task using audio alone.

To generate input data, the first preprocessing is to extract audio and video features from the video and create clusters using k-means. Audiovisual frames are the input to the AV-HuBERT model and cluster IDs are the output.

Figure 1: Grouping video and audio functionality

The next step is similar to BERT, a self-supervising language model, using masks over scopes of audio and visual streams, so the mode can predict and learn context. By merging these features into contextualized representations using transformers, one can compute the loss function on images where the audio or visual is hidden.

Meta AI has published the framework implementing this code on GitHub.

To load a pre-trained model, the following script may be useful:

>>> import fairseq
>>> import hubert_pretraining, hubert
>>> ckpt_path = "/path/to/the/checkpoint.pt"
>>> models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
>>> model = models[0]

Figure 2: Representation of the AV-HuBERT model

This framework can be useful for detecting deepfakes and generating more realistic avatars in AR. By synchronizing image and speech, this model can help generate speaking avatars consistent with facial movement. Image text is always a hot topic in the AI ​​research community. In addition, this model can help in noisy environments to recognize speech more effectively. Another great potential application enables lip-syncing in many low-resource languages ​​because it requires less data to train.


Source link

Comments are closed.