Mozilla Releases the Common Voice 2.0 Dataset

Originally published at: Mozilla Releases the Common Voice 2.0 Dataset - Mycroft

On February 28, Mozilla officially released the Common Voice 2.0 Dataset. Thanks to the efforts of the community at voice.mozilla.org, anyone now has access to the largest transcribed, public domain voice dataset in the world. From Mozilla:

Common Voice 2.0 includes 18 different languages, adding up to almost 1,400 hours of recorded voice data from more than 42,000 contributors! The Common Voice 2.0 data set includes audio and transcripts for Breton, Catalan, Mandarin Chinese (Traditional), Chuvash, Dutch, English, Estonian, French, German, Hakha Chin, Irish, Italian, Kabyle, Kyrgyz, Slovenian, Tatar, Turkish, and Welsh.

 

Mozilla has made this data truly public domain by releasing it under a CC0 “No Rights Reserved” license. This, according to Creative Commons, provides the best opportunity for others to “freely build upon, enhance and reuse the works for any purposes.”

Congratulations to Mozilla for this achievement! If you’re a Mycroft Community member who’s unfamiliar with the Common Voice project, head to voice.mozilla.org to contribute to the Common Voice dataset. You can donate your voice or validate recordings for accuracy.

Mozilla's Common Voice Dataset provides the largest public domain dataset of tagged speech in the world

Image: Mozilla

Why do I care?

Large established companies have a big advantage when it comes to machine learning, through their access to data. Providing open data sets allows any researchers, projects, companies, or even hobbyists to drive innovation in voice technology forward. That’s certainly something we love here at Mycroft.

The diversity of contributions contained in this data set is also an important step forward in bringing voice technology to more populations around the world. Common Voice 2.0 represents a global community of voice contributors. In keeping with the open ethos, contributors opt-in to provide metadata like their age, sex, and accent so that their voice clips are tagged with information useful in training speech recognition engines.

For example, Mozilla recently joined forces with the Deutsche Gesellschaft fĂĽr Internationale Zusammenarbeit (GIZ) and co-hosted an ideation hackathon in Kigali to create a speech corpus for Kinyarwanda, laying the foundation for local technologists in Rwanda to develop open source voice technologies in their own language.

Aside from speech technology, you can use Common Voice to build an art project, or research linguistic speech patterns, without restriction.

What does this mean for Mycroft?

We’re so excited to see Mozilla progressing on the Common Voice project, and we’re excited for the improvements it can mean to the DeepSpeech project. Understanding a broader diversity of speakers – with different accents, pitch, and tone – will eventually make Mycroft more accessible and private for all. Mycroft continues to collect utterances from our Opted-In users to build a Mycroft-specific dataset that will ensure the best usability for Mark I, Picroft, and future Mycroft platforms. You can help by Opting-In yourself, at the bottom of your Basic Settings in your Mycroft Account.

2 Likes