Fix CommonVoice dataset for Speech Recognition problem #1852

RegaliaXYZ · 2020-09-17T13:51:05Z

Fixed Common Voice data generator by adding a flag to the datagen.py file for the language code (--language="en", if not specified it defaults to english) and dynamically downloading the correct language dataset.

Also had to rework the architecture of the data unpacking since Mozilla changed their folder architecture.

Also removed the sub-problems of Common Voice (Noisy, Clean, FullTestClean since all the previous .tsv files were merged (no more other-train, other-test etc)

googlebot · 2020-09-17T13:51:13Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

RegaliaXYZ · 2020-09-17T13:51:55Z

@googlebot I signed it!

googlebot · 2020-09-17T13:52:00Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

…by using DictReader Fixed the relative check not working during extraction Removed unnecessary collect_data function since there's no need of it anymore

RegaliaXYZ added 3 commits September 17, 2020 15:20

Update to Common Voice data generator

af4414d

Small fix to data unpacking

b6035e2

Small fix to loop

e5b3831

googlebot added the cla: no PR author has not signed CLA label Sep 17, 2020

googlebot added cla: yes PR author has signed CLA and removed cla: no PR author has not signed CLA labels Sep 17, 2020

Fixed data generation for all formats of .tsv in CommonVoice Dataset …

cec54db

…by using DictReader Fixed the relative check not working during extraction Removed unnecessary collect_data function since there's no need of it anymore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CommonVoice dataset for Speech Recognition problem #1852

Fix CommonVoice dataset for Speech Recognition problem #1852

RegaliaXYZ commented Sep 17, 2020

googlebot commented Sep 17, 2020

RegaliaXYZ commented Sep 17, 2020

googlebot commented Sep 17, 2020

Fix CommonVoice dataset for Speech Recognition problem #1852

Are you sure you want to change the base?

Fix CommonVoice dataset for Speech Recognition problem #1852

Conversation

RegaliaXYZ commented Sep 17, 2020

googlebot commented Sep 17, 2020

What to do if you already signed the CLA

Individual signers

Corporate signers

RegaliaXYZ commented Sep 17, 2020

googlebot commented Sep 17, 2020