Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spectral feature script: parallelize, test at scale. #51

Open
3 of 6 tasks
msonderegger opened this issue Aug 9, 2019 · 5 comments
Open
3 of 6 tasks

spectral feature script: parallelize, test at scale. #51

msonderegger opened this issue Aug 9, 2019 · 5 comments
Assignees

Comments

@msonderegger
Copy link
Member

msonderegger commented Aug 9, 2019

While you (@MichaelGoodale ) and Michael M are still working, I would like to optimize the spectral features R script referred to here:
https://iscan.readthedocs.io/en/latest/tutorials_iscan.html#tutorial-4-custom-scripts

and test that tabular import/export works at scale -- on one of our large corpora -- in reasonable time.

The barrier to doing this testing before was that the script is slow: it loops through every row of the CSV to calculate spectral features, one row at a time. you should:

  1. modify the script so it can run in parallel, with a user-specified number of cores. Just add a binary flag and nCores as arguments the user has to fill in at top of the script, and make the default no parallelization (so the demo you wrote for ISCAN RTD works out of box). doParallel/forEach is one way to do this, I think.

  2. run the script for all sibilants from two large corpora which also have different phonesets -- let's say Buckeye and SOTC. So you'd need to tabular export info about all sibilants, then run using parallel option on roquefort.

2a. Optional: if this takes too long on roquefort, figure out how to do it on the compute canada cluster. (I have a working example somewhere of R in parallel on CC cluster if needed.) The main issue here might be having enough space to store datasets on CC servers.. let me know if it's an issue.

  1. tabular import for the two corpora

  2. for the two corpora, do the same export as in sibilants.py (all word-initial stressed-syllable sibilants etc., one column gives speech rate, another the word label...), but exporting all the new measures (calculated with the R script) rather than the Praat-script-calculated measures we've used before.

While you are doing this, please keep a record of how long steps 2, 3, and 4 (each) take -- to assess feasibility of getting these measures across many corpora.

  • modify script to allow running in parallel
  • tabular export for Buckeye, SOTC
  • run R script for Buckeye
  • run R script for SOTC
  • import results back into ISCAN-accessible databases on roquefort
  • do export of these measures as in sibilants.py
@msonderegger
Copy link
Member Author

@MichaelGoodale Something to bear in mind for parallelization here: you probably want to parallelize in batches, given how quickly each row of the CSV is processed (a second or two) and the latency involved with just starting and stopping a job on a single core. instead, probably better to send off a batch of like 50-100 rows to each core to be processed. there must be parallelization libraries in R that make this easy (or do it by default).

@MichaelGoodale
Copy link
Contributor

So the parallelisation makes the script quite fast, over 36667 sibilants from SOTC, it takes only about a minute.

@msonderegger
Copy link
Member Author

Whoa! And this is just on roquefort with #? cores?

@MichaelGoodale
Copy link
Contributor

20 cores on roquefort! The only issue with the script now is that different corpora have different directory structures so you need to edit the script for each corpus.

(I.e. if the audio files are all in one directory, or in speaker directories)

@msonderegger
Copy link
Member Author

Excellent! Could you check in the new script in spade repo?

That's OK.. can you just leave some comments in the script about how it has to be changed? I assume there are just 2-3 possibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants