-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spectral feature script: parallelize, test at scale. #51
Comments
@MichaelGoodale Something to bear in mind for parallelization here: you probably want to parallelize in batches, given how quickly each row of the CSV is processed (a second or two) and the latency involved with just starting and stopping a job on a single core. instead, probably better to send off a batch of like 50-100 rows to each core to be processed. there must be parallelization libraries in R that make this easy (or do it by default). |
So the parallelisation makes the script quite fast, over 36667 sibilants from SOTC, it takes only about a minute. |
Whoa! And this is just on roquefort with #? cores? |
20 cores on roquefort! The only issue with the script now is that different corpora have different directory structures so you need to edit the script for each corpus. (I.e. if the audio files are all in one directory, or in speaker directories) |
Excellent! Could you check in the new script in spade repo? That's OK.. can you just leave some comments in the script about how it has to be changed? I assume there are just 2-3 possibilities. |
While you (@MichaelGoodale ) and Michael M are still working, I would like to optimize the spectral features R script referred to here:
https://iscan.readthedocs.io/en/latest/tutorials_iscan.html#tutorial-4-custom-scripts
and test that tabular import/export works at scale -- on one of our large corpora -- in reasonable time.
The barrier to doing this testing before was that the script is slow: it loops through every row of the CSV to calculate spectral features, one row at a time. you should:
modify the script so it can run in parallel, with a user-specified number of cores. Just add a binary flag and nCores as arguments the user has to fill in at top of the script, and make the default no parallelization (so the demo you wrote for ISCAN RTD works out of box).
doParallel
/forEach
is one way to do this, I think.run the script for all sibilants from two large corpora which also have different phonesets -- let's say Buckeye and SOTC. So you'd need to tabular export info about all sibilants, then run using parallel option on roquefort.
2a. Optional: if this takes too long on roquefort, figure out how to do it on the compute canada cluster. (I have a working example somewhere of R in parallel on CC cluster if needed.) The main issue here might be having enough space to store datasets on CC servers.. let me know if it's an issue.
tabular import for the two corpora
for the two corpora, do the same export as in sibilants.py (all word-initial stressed-syllable sibilants etc., one column gives speech rate, another the word label...), but exporting all the new measures (calculated with the R script) rather than the Praat-script-calculated measures we've used before.
While you are doing this, please keep a record of how long steps 2, 3, and 4 (each) take -- to assess feasibility of getting these measures across many corpora.
The text was updated successfully, but these errors were encountered: