More public data is key to democratizing ML, says MLCommons • The Register
Unless you’re an English speaker and have as neutral an American accent as possible, you’ve probably run into a digital assistant who can’t understand you. Hopefully, a few open source datasets from MLCommons might help future systems make your voice heard.
The two datasets, which were made generally available in December, are the People’s Speech Dataset (PSD), a database of 30,000 hours of spontaneous English speech; and the Multilingual Spoken Words Corpus (MSWC), a dataset of some 340,000 keywords in 50 languages.
By making both datasets publicly available under the CC-BY and CC-BY-SA licenses, MLCommons hopes to democratize machine learning – that is, make it available to everyone – and help push the industry towards data-centric AI.
David Kanter, executive director and founder of MLCommons, told Nvidia in a podcast this week that he sees data-centric AI as a conceptual pivot from “which model is more accurate” to “what can we do with the data to improve the accuracy of the model.” For that, Kanter said, the world needs a lot of data.
Increase understanding with the speech of the people
Spontaneous speech recognition is still a challenge for AIs, and PSD could help learning machines better understand familiar speech, speech disorders, and accents. If a database like this existed earlier, said PSD project manager Daniel Galvez, “we’d probably be talking to our digital assistants in a much less robotic way.”
The 30,000 hours of speech in the People’s Speech Dataset was extracted from a total of 50,000 hours of publicly available speech taken from the Internet Archive digital library, and it has two unique qualities: first, it is entirely spontaneous speech , which means it contains all the tics and inaccuracies of the average conversation. Second, everything came with transcripts.
Using some tricks from the CUDA-powered inference engine, the team behind PSD was able to reduce the time to label this massive dataset to just two days. The end result was a dataset that can allow chatbots and other voice recognition programs to better understand those whose voices differ from those of English-speaking white American men.
Galvez said that speech disorders, neurological issues, and accents are all misrepresented in the datasets, and as a result, “[those types of speech] are not well understood by commercial products.”
Again, Kanter said, projects like these fail due to a lack of data that includes various stakeholders.
A corpus to expand the reach of digital assistants
The corpus of multilingual spoken words is a different animal from the PSD. Instead of full sentences, the Corpus consists of 340,000 keywords in 50 languages. “To our knowledge, this is the only open source spoken word dataset for 46 of those 50 languages,” Kanter said.
Digital assistants, like chatbots, are prone to bias based on their training datasets, which has prevented them from understanding as quickly as they could have. Kanter predicts that digital assistants will be available worldwide “by the middle of the decade,” and he sees the MSWC as a key foundation for getting there.
“When you look at equivalent databases, it’s Mandarin, English, Spanish, and then it drops pretty quickly,” Kanter said.
Kanter said the datasets have already been tested by some of MLCommons’ member companies. So far, he said, they’ve been used to denoise audio and video recordings of crowded rooms and conferences, and to improve speech recognition.
In the near future, Kanter said he hopes the datasets will be widely adopted and used alongside other public datasets that typically serve as sources for ML and AI researchers. ®