Google AI introduces a Common Corpus of Voice-Based Speech Translation (CVSS) that can be directly used for training direct S2ST models without any additional processing
This research summary is based on the paper 'CVSS Corpus and Massively Multilingual Speech-to-Speech Translation' Please don't forget to join our ML Subreddit
Speech-to-speech translation is the automatic translation of speech from one language to speech in another (S2ST). S2ST models have been widely accepted to bridge communication gaps between people who speak different languages.
S2ST systems are traditionally developed with a text-centric cascade comprising automatic speech recognition (ASR), text-to-text (MT) machine translation, and text-to-speech synthesis (TTS) subsystems. Recent studies have introduced S2ST, which does not rely on intermediate textual representation. However, there are currently very few publicly available corpora directly relevant to such research.
A new Google study has released CVSS, a Speech-to-Speech translation corpus based on common voice. From Arabic to Slovenian, CVSS provides sentence-level parallel speech-to-speech translation pairs to English from 21 languages. The Common Voice Project used 1,153 hours of crowdsourced human volunteer recordings to create the source speeches in these 21 languages.
The CVSS corpus is generated directly from the CoVoST 2 ST corpus, further derived from the Common Voice speech corpus.
- Common Voice is a multilingual transcribed speech corpus created specifically for ASR. It was prepared by outsourcing the discourse by asking volunteers to read textual content from Wikipedia and other text corpora. There are 11,192 hours of validated speech in 76 languages in the current version 7.
- CoVoST 2 is a large-scale multilingual ST corpus based on Common Voice. It includes translation from 21 languages to English and 15 languages to English. Experienced translators put together the translation from Common Voice scripts. In total, there are 1,154 hours of speech in the 21 X-En language pairs.
For all source speeches, two English-translated speech versions are provided, both synthesized using state-of-the-art TTS systems. Each version offers unique values which are mentioned below:
- CVSS-C: Each of the 719 hours of translation talks is given by a single canonical speaker who delivers a consistent speaking style. These interviews, despite their synthetic nature, have a high level of naturalness and cleanliness. These features simplify target speech modeling and enable trained models to deliver high-quality speech translation suitable for user-oriented applications.
- CVSS-T: Translation speeches, totaling 784 hours, are invoices transferred from the corresponding source speeches. Despite being in different languages, each S2ST pair has similar voices on both sides. This makes the dataset suitable for building models that preserve speakers’ voices while translating speech into foreign languages.
The two S2ST datasets contain 1,872 and 1,937 hours of speech, respectively, in addition to the source conversations. CVSS provides normalized translation text corresponding to pronunciation in translation speech, which can aid in both model training and evaluation.
The target discourses of CVSS are translation rather than interpretation. Translation is often literal and accurate, while interpretation usually summarizes and often omits less relevant aspects. The interpretation also has more linguistic diversity and disfluency.
The team trained and compared a basic waterfall S2ST model and two basic direct S2ST models on each CVSS version.
S2ST cascade: The team trained the ST model on CoVoST 2 to build robust S2ST cascade baselines. To produce very powerful cascading S2ST baselines, this ST model is coupled with the same TTS models used to build CVSS (ST TTS). These models outperformed the prior art by +5.8 average BLEU across the 21 language pairs (specified in the paper) when trained on the corpus alone.
Direct S2ST: Using Translatotron and Translatotron 2, they created two basic direct S2ST models. The translation quality of the Translatotron 2 (8.7 BLUE) approaches that of the strong cascade S2ST baseline when trained from the start using CVSS (10.6 BLUE). Moreover, the ASR transcribed translation difference is only 0.7 BLUE when pre-training is applied to both.