MLCommons Releases Both Multilingual Voice Dataset and Large 30,000 Hour Diverse English Dataset to Support the Democratization of Machine Learning

Voice recognition Personal assistant on mobile application. Concept flat vector illustration of device with microphone icon on screen and mimic voice and sound lines.

The MLCommons Association, an open engineering community dedicated to making machine learning more accessible to everyone, has released free datasets and technologies to help democratize machine learning. The People’s Speech Dataset and the Multilingual Spoken Words Corpus are the two Important New Data Sets (MSWC). Organizations can use these revolutionary, open-licensed datasets to build improved artificial intelligence models.

About the MLCommons Association:

MLCommons’ goal is to level the playing field for the development of AI. Small businesses are clearly at a disadvantage when developing speech recognition models, as the most comprehensive data sets always come with high license fees. Additionally, tech giants like Google LLC and Apple Inc. can amass large amounts of free workout data through devices like cell phones.

The MLCommons association is focused on collaborative engineering work that builds tools for the entire machine learning industry. It is executed using thresholds and performance metrics, public datasets, and business processes. MLCommons works with its more than 50 founding member partners: global technology providers, academics and researchers.

The People Speech Dataset:

The People’s Speech Dataset is a supervised conversational data set with 30,000 hours of data. It is one of the most comprehensive English speeches in the world, and it’s free to use for both academic and business purposes. This dataset aims to make voice technologies, such as voice assistants and transcription, more accessible to everyone, while enabling the machine learning community to innovate. Researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA contributed to the dataset.

Multilingual Spoken Word Corpus (MSWC):

It is a large audio voice database with over 340,000 keywords in 50 languages ​​and 23.4 million instances. Because they relied on manual efforts to collect and rate thousands of statements for each keyword, previous datasets were often limited to a single language. It can be used to train machine learning models for applications such as call centers and smart devices, according to MLCommons.

The MLCommons Association invites people to participate in the new DataPerf benchmark suite, which measures and promotes data-driven AI research.

What is this DataPerf?

It promotes data-centric AI innovation by assessing the quality of datasets for common machine learning tasks and the impact of improving datasets. Understanding and improving datasets receives less attention than mastering and developing models. DataPerf encourages and monitors development in these critical areas. Traditionally, AI research has focused on improving model structures and making them available to the public. On the other hand, the engineering and maintenance of datasets is behind schedule and is often laborious and on-time.

The MLCommons Association is a strong supporter of data-driven AI (DCAI), a discipline that focuses on methodical engineering of data for AI systems by creating software tools and effective engineering techniques for make the development and retention of datasets more efficient. Datasets and open technologies, such as DataPerf, help drive innovation in machine learning and support the DCAI movement.

People’s Speech Datasets Research:


Multilingual spoken word corpus search:


The references:

  • Learning.html

Comments are closed.