spectral feature script: parallelize, test at scale. #51

msonderegger · 2019-08-09T15:34:57Z

While you (@MichaelGoodale ) and Michael M are still working, I would like to optimize the spectral features R script referred to here:
https://iscan.readthedocs.io/en/latest/tutorials_iscan.html#tutorial-4-custom-scripts

and test that tabular import/export works at scale -- on one of our large corpora -- in reasonable time.

The barrier to doing this testing before was that the script is slow: it loops through every row of the CSV to calculate spectral features, one row at a time. you should:

modify the script so it can run in parallel, with a user-specified number of cores. Just add a binary flag and nCores as arguments the user has to fill in at top of the script, and make the default no parallelization (so the demo you wrote for ISCAN RTD works out of box). doParallel/forEach is one way to do this, I think.
run the script for all sibilants from two large corpora which also have different phonesets -- let's say Buckeye and SOTC. So you'd need to tabular export info about all sibilants, then run using parallel option on roquefort.

2a. Optional: if this takes too long on roquefort, figure out how to do it on the compute canada cluster. (I have a working example somewhere of R in parallel on CC cluster if needed.) The main issue here might be having enough space to store datasets on CC servers.. let me know if it's an issue.

tabular import for the two corpora
for the two corpora, do the same export as in sibilants.py (all word-initial stressed-syllable sibilants etc., one column gives speech rate, another the word label...), but exporting all the new measures (calculated with the R script) rather than the Praat-script-calculated measures we've used before.

While you are doing this, please keep a record of how long steps 2, 3, and 4 (each) take -- to assess feasibility of getting these measures across many corpora.

modify script to allow running in parallel
tabular export for Buckeye, SOTC
run R script for Buckeye
run R script for SOTC
import results back into ISCAN-accessible databases on roquefort
do export of these measures as in sibilants.py

The text was updated successfully, but these errors were encountered:

msonderegger · 2019-08-09T15:39:29Z

@MichaelGoodale Something to bear in mind for parallelization here: you probably want to parallelize in batches, given how quickly each row of the CSV is processed (a second or two) and the latency involved with just starting and stopping a job on a single core. instead, probably better to send off a batch of like 50-100 rows to each core to be processed. there must be parallelization libraries in R that make this easy (or do it by default).

MichaelGoodale · 2019-08-23T18:43:35Z

So the parallelisation makes the script quite fast, over 36667 sibilants from SOTC, it takes only about a minute.

msonderegger · 2019-08-23T18:55:24Z

Whoa! And this is just on roquefort with #? cores?

MichaelGoodale · 2019-08-23T19:04:19Z

20 cores on roquefort! The only issue with the script now is that different corpora have different directory structures so you need to edit the script for each corpus.

(I.e. if the audio files are all in one directory, or in speaker directories)

msonderegger · 2019-08-23T19:28:56Z

Excellent! Could you check in the new script in spade repo?

That's OK.. can you just leave some comments in the script about how it has to be changed? I assume there are just 2-3 possibilities.

msonderegger assigned MichaelGoodale Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spectral feature script: parallelize, test at scale. #51

spectral feature script: parallelize, test at scale. #51

msonderegger commented Aug 9, 2019 •

edited by MichaelGoodale

Loading

msonderegger commented Aug 9, 2019

MichaelGoodale commented Aug 23, 2019

msonderegger commented Aug 23, 2019

MichaelGoodale commented Aug 23, 2019

msonderegger commented Aug 23, 2019

spectral feature script: parallelize, test at scale. #51

spectral feature script: parallelize, test at scale. #51

Comments

msonderegger commented Aug 9, 2019 • edited by MichaelGoodale Loading

msonderegger commented Aug 9, 2019

MichaelGoodale commented Aug 23, 2019

msonderegger commented Aug 23, 2019

MichaelGoodale commented Aug 23, 2019

msonderegger commented Aug 23, 2019

msonderegger commented Aug 9, 2019 •

edited by MichaelGoodale

Loading