auDeep is a Python toolkit for unsupervised feature learning with deep neural networks (DNNs). Currently, the main focus of this project is feature extraction from audio data with deep recurrent autoencoders. However, the core feature learning algorithms are not limited to audio data. Furthermore, we plan on implementing additional DNN-based feature learning approaches.
(c) 2017-2018 Michael Freitag, Shahin Amiriparian, Sergey Pugachevskiy, Nicholas Cummins, Björn Schuller: Universität Passau Published under GPLv3, see the LICENSE.md file for details.
Please direct any questions or requests to Shahin Amiriparian (shahin.amiriparian at tum.de) or Michael Freitag (freitagm at fim.uni-passau.de).
If you use auDeep or any code from auDeep in your research work, you are kindly asked to acknowledge the use of auDeep in your publications.
M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and B.Schuller. auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks, arXiv preprint arXiv:1712.04382, 2017, 5 pages
S. Amiriparian, M. Freitag, N. Cummins, and B. Schuller. Sequence to sequence autoencoders for unsupervised representation learning from audio, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop, pp. 17-21, 2017
This project ships with a setup.py
file which allows for installation and automatic dependency resolution through pip
. We strongly recommend installing the project in its own Python virtualenv
, since the Tensorflow source code needs to be patched in order to fix a bug that has not yet been resolved in the official repositories (see below).
The minimal requirements to install auDeep are listed below.
- Python 3.5
- TkInter (
python3-tk
package on Ubuntu, selectable during Python install on Windows) if the interactive Matplotlib backend is used. Check if TkInter is installed and configured correctly withpython3 -c "import _tkinter; import tkinter; tkinter._test()"
. A small window with two buttons should appear. - virtualenv (
pip3 install virtualenv
)
The setup.py
script automatically checks if a compatible CUDA version is available on the system. If GPU support is desired, make sure that the following dependencies are installed before installing auDeep.
- CUDA Toolkit 8.0
- cuDNN 5.1 (optional)
By default, the CUDA libraries are required to be available on the system path (which should be the case after a standard installation). If, for some reason, the setup.py
script fails to detect the correct CUDA version, consider manually installing tensorflow-gpu
1.1.0 prior to installing auDeep.
These Python packages are installed automatically during setup by pip
, and are just listed for completeness.
- cliff
- liac-arff
- matplotlib
- netCDF4
- pandas
- scipy
- sklearn
- tensorflow or tensorflow-gpu 1.1.0
- xarray
The recommended type of installation is through pip
in a separate virtualenv. This guide outlines the standard installation procedure, assuming the following initial directory structure.
. Working directory
└─── auDeep Project root directory
├─── audeep Python source directory
| └─── ...
├─── samples
| └─── ... Scripts to reproduce experiments reported in our JMLR submission
├─── patches
| └─── fix_import_bug.patch Patch to fix TensorFlow bug
├─── .gitignore
├─── LICENSE.md License information
├─── README.md This readme
└─── setup.py Setup script
Start by creating a Python virtualenv for the installation
> virtualenv -p python3 audeep_virtualenv
This will create a folder named audeep_virtualenv
in the current working directory, containing a minimal Python environment. The name and the location of the virtualenv can be chosen arbitrarily, but, for the purpose of this guide, we assume that it is named audeep_virtualenv
and stored alongside the project root directory. Subsequently, activate the virtualenv with
Linux:
> source audeep_virtualenv/bin/activate
Windows:
> .\audeep_virtualenv\Scripts\activate.bat
If everything went well, you should see (audeep_virtualenv)
prepended to the command prompt, indicating that the virtualenv is active. Continue by installing auDeep with
> pip3 install ./auDeep
where auDeep
is the name of the project root directory containing the setup.py
file. This will fetch all required depencendies and install them in the virtualenv.
Once installation is complete, the TensorFlow installation within the virtualenv needs to be patched, in order to fix a bug that has not yet been resolved in the official repositories. The required changes are provided as a GNU patch file at patches/fix_import_bug.patch
below the project root directory. To apply the patch the GNU patch utility is required, which should be installed by default on Linux systems. On Windows, it can be obtained, for example, through cygwin
. Please note that you have to manually select the patch
package during installation of cygwin
, as it is not installed by default.
Linux:
> patch audeep_virtualenv/lib/python3.5/site-packages/tensorflow/python/framework/meta_graph.py auDeep/patches/fix_import_bug.patch
Windows:
> patch .\audeep_virtualenv\Lib\site-packages\tensorflow\python\framework\meta_graph.py .\auDeep\patches\fix_import_bug.patch
This completes installation, and the toolkit CLI can be accessed through the audeep
binary.
> audeep --version
The virtualenv can be deactivated at any time using the deactivate
command.
> deactivate
In this section, we provide a step-by-step tutorial on how to perform representation learning with auDeep. A complete documentation of the auDeep command line interface can be found in the next section.
We assume in this guide that auDeep has been installed by following the instructions above, and that the audeep
executable is accessible from the command line. This can be checked, for instance, by typing audeep --version
, which should print some copyright information and the application version.
Representation learning with auDeep is performed in several distinct stages.
- Extraction of spectrograms and data set metadata from raw audio files (
audeep preprocess
) - Training of a DNN on the extracted spectrograms (
audeep ... train
) - Feature generation using a trained DNN (
audeep ... generate
) - Evaluation of generated features (
audeep ... evaluate
) - Exporting generated features to CSV/ARFF (
audeep export
)
We use the ESC-10 data set for environmental sound classification in this guide, which contains 400 instances from 10 classes. In the command line, navigate to a directory of your choice. In the following, we will assume that any commands are executed from the directory you choose in this step. Retrieve the ESC-10 data set from Github with
> git clone https://github.com/karoldvl/ESC-10.git
> pushd ESC-10
> git checkout 553c8f1743b9dba6b282e1323c3ca8fa76923448
> popd
This will store the data set in a subfolder called ESC-10
. As the original ESC-10 repository has been merged with the ESC-50 repository, we have to manually checkout the correct commit.
First of all, we need to extract spectrograms and some metadata from the raw audio files we downloaded during the previous step. In order to get a general overview of the audio files contained in a data set, we can use the following command.
> audeep inspect raw --basedir ESC-10
This will print some logging messages, and a table containing information about the data set.
+-----------------------------------+
| data set information |
+------------------------+----------+
| number of audio files | 400 |
| number of labels | 10 |
| cross validation folds | 5 |
| minimum sample length | 3.64 s |
| maximum sample length | 7.24 s |
| sample rate | 44100 Hz |
| channels | 1 |
+------------------------+----------+
As we can see, the ESC-10 data set contains audio files that are between 3.64 seconds and 7.24 seconds long, contain one channel, and are sampled at 44.1 kHz.
Next, we are going to determine suitable parameters for spectrogram extraction. Our results have shown that, in general, auDeep requires slightly larger FFT windows during spectrogram extraction than one would usually use to extract, for example, MFCCs. Furthermore, auDeep works well on Mel-spectrograms with a relatively large number of frequency bands. As a reasonable starting point for the ESC-10 data set, we would recommend using 80 ms wide FFT windows with overlap 40 ms, and 128 mel frequency bands.
In our personal opinion, visual feedback is a great aid in selecting parameters for spectrogram extraction. Use the following command to quickly plot a spectrogram with the parameters recommended above.
> audeep preprocess --basedir ESC-10 --window-width 0.08 --window-overlap 0.04 --mel-spectrum 128 --fixed-length 5 --pretend 10
Here, the --window-width 0.08
and --window-overlap 0.04
options specify the FFT window width and overlap in seconds, respectively. With the --mel-spectrum 128
, we indicate that 128 mel frequency bands should be extracted, and the --fixed-length 5
option indicates that we want to extract spectrograms from 5 seconds of audio. If samples are shorter than 5 seconds, they are padded with silence, and if they are longer, they are cut to length. Finally, the --pretend 10
option tells audeep
to extract and plot a single spectrogram from the 10th instance in the data set. The ordering of instances is somewhat arbitrary, but deterministic between successive calls to audeep
.
The command will open a window with a plot of the spectrogram, and an amplitude histogram showing the distribution of amplitudes on the dB scale (auDeep normalizes spectrograms to 0 dB). As you can see from these plots, the audio files in the ESC-10 data set contain quite a bit of background noise. Since this can confuse the representation learning algorithms, we recommend filtering some of this background noise, by clipping amplitudes below a certain threshold. We recommend a threshold around -45 dB to -60 dB as a starting point, which can be specified using the --clip-below
option.
> audeep preprocess --basedir ESC-10 --window-width 0.08 --window-overlap 0.04 --mel-spectrum 128 --fixed-length 5 --clip-below -60 --pretend 10
Once you are content with your parameter choices, initiate spectrogram extraction for the full data set by replacing the --pretend
option with the --output
option.
> audeep preprocess --basedir ESC-10 --window-width 0.08 --window-overlap 0.04 --mel-spectrum 128 --fixed-length 5 --clip-below -60 --output spectrograms/esc-10-0.08-0.04-128-60.nc
After the command finishes, the extracted spectrograms have been stored in netCDF 4 format in a file called spectrograms/esc-10-0.08-0.04-128-60.nc
. Furthermore, since auDeep recognizes the ESC-10 data set, instance labels and the predefined cross-validation setup are stored alongside the spectrograms.
Next, we are going to train a recurrent sequence to sequence autoencoder on the spectrograms extracted in the previous step. There are numerous parameters that can be used to fine-tune autoencoder training. For the purposes of this guide, we use parameter choices that we have found to work well during our preliminary experiments. However, we encourage the user to experiment with different parameter values.
We are going to train an autoencoder with 2 recurrent layers (--num-layers 2
) containing 256 GRU cells (--num-units 256
, GRU is used by default) in the encoder and decoder. The encoder RNN is going to be unidirectional (default setting), and the decoder RNN is going to be bidirectional (--bidirectional-decoder
). Training is going to be performed for 64 epochs (--num-epochs 64
) with learning rate 0.001 (--learning-rate 0.001
) and 20% dropout (--keep-prob 0.8
). The batch size during training has to be chosen depending on the amount of memory available, but 64 should be a good starting point (--batch-size 64
).
> audeep t-rae train --input spectrograms/esc-10-0.08-0.04-128-60.nc --run-name output/esc-10-0.08-0.04-128-60/t-2x256-x-b --num-epochs 64 --batch-size 64 --learning-rate 0.001 --keep-prob 0.8 --num-layers 2 --num-units 256 --bidirectional-decoder
The --input
option specifies the spectrogram file which contains training data, which in our case is the spectrogram file generated during the previous step. The --run-name
option specifies a directory for the training run in which models and logging information are stored.
If desired, training progress can be tracked in a web browser using TensorBoard, the visualization tool shipped with TensorFlow. Open a new console in the same working directory you used up to now, and execute
> tensorboard --reload_interval 2 --logdir output/esc-10-0.08-0.04-128-60/t-2x256-x-b/logs
Once TensorBoard has started, navigate to localhost:6006 in your web browser.
After training has finished, the trained autoencoder can be used to generate features from spectrograms. Assuming that you used the file and directory names suggested above, execute the following command.
> audeep t-rae generate --model-dir output/esc-10-0.08-0.04-128-60/t-2x256-x-b/logs --input spectrograms/esc-10-0.08-0.04-128-60.nc --output output/esc-10-0.08-0.04-128-60/representations.nc
The --model-dir
option specifies the directory containing TensorFlow checkpoints for the trained autoencoder, which usually is the logs
subdirectory of the directory passed to the --run-name
option during autoencoder training. The --input
option specifies the spectrogram file for which we wish to generate features, and the --output
option specifies a file in which to store the generated featuers.
The command will extract the learned hidden representation of each spectrogram as its feature vector, and store these features in the output file. Additionally, as the instance labels and cross-validation setup of the ESC-10 data set have been saved in the spectrogram file, they will be stored together with the generated features as well.
Since instance labels and a cross-validation setup have been stored, we can now use them to evaluate a simple classifier on the learned representations. We are going to use the built-in multilayer perceptron (MLP) for classification, with 2 hidden layers (--num-layers 2
) and 150 hidden units per layer (--num-units 150
). Training is going to be performed for 400 epochs (--num-epochs 400
) with learning rate 0.001 (--learning-rate 0.001
) and 40% dropout (--keep-prob 0.6
). No batching is used during MLP training.
audeep mlp evaluate --input output/esc-10-0.08-0.04-128-60/representations.nc --cross-validate --shuffle --num-epochs 400 --learning-rate 0.001 --keep-prob 0.4 --num-layers 2 --num-units 150
The --input
option points to a file containing generated features, and the --cross-validate
option tells the command to perform cross-validated evaluation using the setup stored in that file. The --shuffle
option specifies that the training data should be shuffled between training epochs, which can improve generalization of the network.
The command will print classification accuracy on each cross validation fold, as well as average classification accuracy and a confusion matrix. If you followed the parameter choices suggested in this guide exactly, accuracy will be around 80% with variations due to random effects.
Optionally, the learned representations can be exported to CSV or ARFF for further processing, such as classification with an alternate algorithm.
audeep export --input output/esc-10-0.08-0.04-128-60/representations.nc --output output/esc-10-0.08-0.04-128-60/csv --format csv
The --input
option points to a file containing generated features, and the --output
option points to a directory which should contain the exported features. The --format
option specifies that the features should be exported as CSV files. The command will create one directory for each cross-validation fold below the output directory, and store the features of the instances in a fold below the respective fold directory.
This project exposes a single command line interface, with several subcommands covering different use cases. Running the audeep
executable without any command line arguments enters interactive mode, in which an arbitrary number of subcommands can be executed. Alternatively, subcommands and their arguments can be passed to the audeep
executable as command line arguments. To see an overview of available options and commands, use audeep --help
or audeep help <command>
.
The following options can be used with all commands listed below.
Option | Default | Description |
---|---|---|
-v, --verbose |
- | Increased verbosity of output, i.e., print all log messages with the DEBUG level or higher. By default, log messages with the INFO level or higher are printed. |
-q, --quiet |
- | Increased verbosity of output, i.e., print all log messages with the WARNING level or higher. By default, log messages with the INFO level or higher are printed. |
--log-file LOG_FILE |
- | Log command output to the specified file. Disabled by default. |
-h, --help |
- | Show help message of the command and exit. |
--version |
- | Show version of the application and exit. |
The following command performs spectrogram extraction from raw audio files (step 1 above).
Extracts metadata, such as labels, cross-validation information, or partition information, from a data set on disk, and extracts spectrograms from raw audio files. The extracted spectrograms together with their metadata are stored in netCDF 4 format. For a description of our data model, see Data Model.
Option | Default | Description |
---|---|---|
--basedir BASEDIR |
required | The data set base directory |
--parser PARSER |
audeep.backend.parsers.MetaParser |
A parser class for the data set file structure. Parsers are used to detect audio files in a data set, and associate metadata such as labels with the audio files. For an overview of available parsers, and how to implement your own parser, see Parsers. For an overview of our data model, see Data Model. |
--output FILE |
./spectrograms.nc |
The output filename. Extracted spectrograms together with their metadata are written in netCDF 4 format to FILE . No output is generated when the --pretend option is used. |
--window-width WIDTH |
0.04 |
The width in seconds of the FFT windows used to extract spectrograms. |
--window-overlap WIDTH |
0.02 |
The overlap in seconds between FFT windows used to extract spectrograms. |
--mel-spectrum FILTERS |
- | Extract log-scale mel spectrograms using the specified number of filters. By default, log-scale power spectrograms are extracted. |
--chunk-length LENGTH |
- | Split audio files into chunks of the specified length in seconds. Must be used together with the --chunk-count option. The audeep fuse chunks command can be used to combine chunks into a single instance, both before and after feature learning. |
--clip-above dB |
- | Clip amplitudes above the specified decibel value. Spectrograms are normalized in such a way that the highest amplitude is 0 dB. |
--clip-below dB |
- | Clip amplitudes below the specified decibel value. Spectrograms are normalized in such a way that the highest amplitude is 0 dB. |
--chunk-count COUNT |
- | Split audio files into the specified number of chunks. Must be used together with the --chunk-length option. If there are more chunks than specified, the excess chunks are discarded. If there are fewer chunks than specified, an error is raised. |
--channels CHANNELS |
MEAN |
How to handle stereo audio files. Valid options are MEAN , which extracts spectrograms from the mean of the two channels, LEFT , and RIGHT , which extract spectrograms from the left and right channel, respectively, and DIFF , which extracts spectrograms from the difference of the two channels. |
--fixed-length LENGTH |
- | Ensure that all samples have exactly the specified length in seconds, by cutting or padding audio appropriately. For samples that are longer than the specified length, only the first LENGTH seconds of audio are used. For samples that are shorter than the specified length, silence is appended at the end. The --center-fixed option can be used to modify this behavior. If chunking is used, it can be useful to set the fixed length slightly longer than the product of chunk length and chunk count, to avoid errors due to inconsistent chunk lengths. |
--center-fixed |
- | Only takes effect when --fixed-length is set. By default, samples are cut from the end, or padded with silence at the end. If this option is set, samples are cut equally at the beginning and end, or padded with silence equally at the beginning and end. |
--pretend INDEX |
- | Do not process the entire data set. Instead, extract and plot a single spectrogram from the audio file at the specified index (as determined by the data set parser). |
The following commands are used to train a feature learning DNN on spectrograms (step 2 above).
The following options apply to all audeep ... train
commands. Training progress can be monitored using Tensorboard. Assuming a training run with name NAME
(see the --run-name
option) has been started in the current working directory, Tensorboard can be started as follows.
> tensorboard --reload_interval 2 --logdir NAME/logs
The --reload_interval
option for Tensorboard is not required, but results in more frequent updates of the displayed data.
Option | Default | Description |
---|---|---|
--batch-size BATCH_SIZE |
64 |
The minibatch size for training |
--num-epochs NUM_EPOCHS |
10 |
The number of training epochs |
--learning-rate RATE |
1e-3 |
The learning rate for the Adam optimiser |
--run-name NAME |
test-run |
A directory for the training run. After each epoch, the autoencoder model is written to the logs subdirectory of this folder. The model consists of several files, which all start with model-<STEP> , where STEP is the training iteration, i.e. epoch * number of batches . Furthermore, diagnostic information about the training run is written to the logs subdirectory, which can be viewed using TensorBoard. |
--input INPUT_FILES... |
required | One or more netCDF 4 files containing spectrograms and metadata, structured according to our data model. If more than one data file is specified, all spectrograms must have the same shape. Each data file is converted to TFRecords format and written to a temporary directory internally (see the --tempdir option). |
--tempdir DIR |
System temp directory | Directory where temporary files can be written. If specified, this directory must not exist, i.e. it is created by the application, and is deleted after training. Disk space requirements are roughly the same as the input data files. |
--continue |
- | If set, training is continued from the latest checkpoint stored under the directory specified by the --run-name option. If no checkpoint is found, the command will fail. If not set, a new training run is started, and the command will fail if there are previous checkpoints. |
Train a time-recurrent autoencoder on spectrograms, or other two-dimensional data. A time-recurrent autoencoder processes spectrograms sequentially along the time-axis, processing the instantaneous frequency vectors at each time step. The time-recurrent autoencoder learns to represent entire spectrograms of possibly varying length by a hidden representation vector with fixed dimensionality.
Option | Default | Description |
---|---|---|
--num-layers LAYERS |
1 |
Number of recurrent layers in the encoder and decoder |
--num-units UNITS |
16 |
Number of RNN cells in each recurrent layer |
--bidirectional-encoder |
- | Use a bidirectional encoder |
--bidirectional-decoder |
- | Use a bidirectional decoder |
--keep-prob P |
0.8 |
Keep neural network activations with the specified probability, i.e. apply dropout with probability 1-P |
--encoder-noise P |
0.0 |
Corrupt encoder inputs with probability P . If P is greater than zero, each time step is set to zero with probability P |
--feed-previous-prob P |
0.0 |
By default, the expected output of the decoder at the previous time step is fed as the input at the current time step. If P is greater than zero, the actual decoder output is fed instead of the expected output with probability P . |
--cell |
GRU |
The type of RNN cell. Valid choices are GRU and LSTM |
Train a frequency-recurrent autoencoder on spectrograms, or other two-dimensional data. A frequency-recurrent autoencoder processes spectrograms sequentially along the frequency-axis. It does not process multiple steps along the time-axis simultaneously, i.e. each instantaneous frequency vector is treated as an individual example. The frequency-recurrent autoencoder learns to represent the instantaneous frequency vectors by a hidden representation vector. Thus, it can be used as an additional preprocessing step before training a time-recurrent autoencoder.
Option | Default | Description |
---|---|---|
--freq-window-width WIDTH |
32 |
Split the input frequency vectors into windows with width WIDTH and overlap specified by the --freq-window-overlap option |
--freq-window-overlap OVERLAP |
24 |
Split the input frequency vectors into windows with width specified by the --freq-window-width option and overlap OVERLAP |
--num-layers LAYERS |
1 |
Number of recurrent layers in the encoder and decoder |
--num-units UNITS |
16 |
Number of RNN cells in each recurrent layer |
--bidirectional-encoder |
- | Use a bidirectional encoder |
--bidirectional-decoder |
- | Use a bidirectional decoder |
--keep-prob P |
0.8 |
Keep neural network activations with the specified probability, i.e. apply dropout with probability 1-P |
--encoder-noise P |
0.0 |
Corrupt encoder inputs with probability P . If P is greater than zero, each time step is set to zero with probability P |
--feed-previous-prob P |
0.0 |
By default, the expected output of the decoder at the previous time step is fed as the input at the current time step. If P is greater than zero, the actual decoder output is fed instead of the expected output with probability P . |
--cell |
GRU |
The type of RNN cell. Valid choices are GRU and LSTM |
Train a frequency-time-recurrent autoencoder on spectrograms, or other two-dimensional data. A frequency-time-recurrent autoencoder first passes spectrograms through a frequency-recurrent encoder. The hidden representation of the frequency-recurrent encoder is then used as the input of a time-recurrent encoder. The hidden representation of the time-recurrent encoder is used as the initial state of a time-recurrent decoder. Finally, the output of the time-recurrent decoder at each time step is used as the input of a frequency-recurrent decoder, which reconstructs the original input spectrogram.
The main difference to training a frequency-recurrent autoencoder with the audeep f-rae train
command followed by training a time-recurrent autoencoder on the learned features with the audeep t-rae train
command is that in the audeep ft-rae train
command, a joint loss function is used.
Option | Default | Description |
---|---|---|
--freq-window-width WIDTH |
32 |
Split the input frequency vectors into windows with width WIDTH and overlap specified by the --freq-window-overlap option |
--freq-window-overlap OVERLAP |
24 |
Split the input frequency vectors into windows with width specified by the --freq-window-width option and overlap OVERLAP |
--num-f-layers LAYERS |
1 |
Number of recurrent layers in the frequency-recurrent encoder and decoder |
--num-f-units UNITS |
64 |
Number of RNN cells in each recurrent layer of the frequency-recurrent RNNs |
--num-t-layers LAYERS |
2 |
Number of recurrent layers in the time-recurrent encoder and decoder |
--num-t-units UNITS |
128 |
Number of RNN cells in each recurrent layer of the time-recurrent RNNs |
--bidirectional-f-encoder |
- | Use a bidirectional frequency encoder |
--bidirectional-f-decoder |
- | Use a bidirectional frequency decoder |
--bidirectional-t-encoder |
- | Use a bidirectional time encoder |
--bidirectional-t-decoder |
- | Use a bidirectional time decoder |
--keep-prob P |
0.8 |
Keep neural network activations with the specified probability, i.e. apply dropout with probability 1-P |
--cell |
GRU |
The type of RNN cell. Valid choices are GRU and LSTM |
--f-encoder-noise P |
0.0 |
Corrupt frequency encoder inputs with probability P . If P is greater than zero, each time step is set to zero with probability P |
--t-encoder-noise P |
0.0 |
Corrupt time encoder inputs with probability P . If P is greater than zero, each time step is set to zero with probability P |
--f-feed-previous-prob P |
0.0 |
By default, the expected output of the frequency decoder at the previous time step is fed as the input at the current time step. If P is greater than zero, the actual frequency decoder output is fed instead of the expected output with probability P . |
--t-feed-previous-prob P |
0.0 |
By default, the expected output of the time decoder at the previous time step is fed as the input at the current time step. If P is greater than zero, the actual time decoder output is fed instead of the expected output with probability P . |
The following commands are used to generate features using a trained DNN (step 3 above).
The following options apply to all feature generation commands. A trained recurrent autoencoder is used to generate features from spectrograms, or other two-dimensional data.
Option | Default | Description |
---|---|---|
--batch-size BATCH_SIZE |
500 |
The minibatch size. Typically, the minibatch size can be much larger for feature generation than for training. |
--model-dir MODEL_DIR |
required | The directory containing trained models, i.e. the logs subdirectory of a training run. |
--steps STEPS... |
latest training step | Use models at the specified training steps, and generate one set of features with each model. By default, one set of features using the most recent model is generated. |
--input INPUT_FILE |
required | A netCDF 4 files containing spectrograms and metadata, structured according to our data model. |
--output OUTPUTS... |
required | The output file names. One file name is required for each training step specified using the --steps option. |
Generate features using a trained time-recurrent autoencoder. The hidden representation, as learned by the autoencoder, for each spectogram is extracted as a one-dimensional feature vector for the respective instance. The resulting features for each instance as well as instance metadata are once again stored in netCDF 4 format.
Generate features using a trained frequency-recurrent autoencoder. The hidden representation, as learned by the autoencoder, for each instantaneous frequency vector is extracted as a one-dimensional feature vector for the respective spectrogram time step. As opposed to the other feature generation commands, this command generates two-dimensional output for each instance, i.e. it preserves the time-axis. Since much more data is generated per instance, the default minibatch size is set to 64
for this command. The generated features for each instance as well as instance metadata are once again stored in netCDF 4 format.
Generate features using a trained frequency-time-recurrent autoencoder. The hidden representation, as learned by the autoencoder, for each spectogram is extracted as a one-dimensional feature vector for the respective instance. The resulting features for each instance as well as instance metadata are once again stored in netCDF 4 format.
While generated features can easily be exported into CSV/ARFF format for external processing (see below), the application provides the option to directly evaluate a set of generated features using a linear SVM or an MLP. We support evaluation using cross-validation based on predetermined folds, or evaluation using predetermined training, development, and test partitions. Prior to training, instances are shuffled and features are standardised using coefficients computed on the training data. Furthermore, predictions can be generated on unlabelled data and saved in CSV format.
The following options apply to both the audeep svm evaluate
, and the audeep mlp evaluate
commands. Cross-validated evaluation and partitioned evaluation are mutually exclusive, i.e. the --cross-validate
option must not be set if the --train-partitions
and --eval-partitions
options are set.
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 file containing one-dimensional feature vectors and associated metadata, structured according to our data model. |
--cross-validate |
- | Perform cross-validated evaluation. Requires the input data to contain cross-validation information. |
--train-partitions PARTITIONS... |
- | Perform partitioned evaluation, and train models on the specified partitions. Requires the input data to contain partition information, and requires the --eval-partitions option to be set. Valid partition identifiers are TRAIN , DEVEL , and TEST . |
--eval-partitions PARTITIONS... |
- | Perform partitioned evaluation, and train models on the specified partitions. Requires the input data to contain partition information, and requires the --train-partitions option to be set. Valid partition identifiers are TRAIN , DEVEL , and TEST . |
--repeat N |
1 |
Repeat evaluation N times, and report mean results. |
--upsample |
- | Upsample instances in the training partitions or splits, so that training occurs with balanced classes. |
--majority-vote |
- | Use majority voting over individual chunks to determine the predictions for audio files. If each audio file has only one chunk, this option has no effect. |
The following options apply to both the audeep svm predict
, and the audeep mlp predict
commands.
Option | Default | Description |
---|---|---|
--train-input TRAIN_FILE |
required | A netCDF 4 file containing one-dimensional feature vectors and associated metadata to use as training data, structured according to our data model. Must be fully labelled. |
--train-partitions PARTITIONS... |
- | Use only the specified partitions from the training data. If this option is set, the data set may contain unlabelled instances, but the specified partitions must be fully labelled. |
--eval-input EVAL_FILE |
required | A netCDF 4 file containing one-dimensional feature vectors and associated metadata for which to generate predictions, structured according to our data model. Even if label information is present in this data set, it is not used. |
--eval-partitions PARTITIONS... |
- | Use only the specified partitions from the evaluation data. |
--upsample |
- | Upsample instances in the training data, so that training occurs with balanced classes. |
--majority-vote |
- | Use majority voting over individual chunks to determine the predictions for audio files. If each audio file has only one chunk, this option has no effect. |
--output FILE |
required | Print predictions in CSV to the specified file. Each line of this file will contain the filename followed by a tab character, followed by the nominal predicted label for the filename. |
Evaluate a set of generated features, or predict labels on some data, using a linear SVM. Internally, this command uses the LibLINEAR backend of sklearn for training SVMs. In addition to the options listed above, this command accepts the following options.
Option | Default | Description |
---|---|---|
--complexity COMPLEXITY |
required | The SVM complexity parameter |
Evaluate a set of generated features, or predict labels on some data, using an MLP with softmax output. Currently, the entire data set is copied to GPU memory, and no batching is performed. In addition to the options listed above, this command accepts the following options.
Option | Default | Description |
---|---|---|
--num-epochs NUM_EPOCHS |
400 |
The number of training epochs |
--learning-rate RATE |
1e-3 |
The learning rate for the Adam optimiser |
--num-layers NUM_LAYERS |
2 |
The number of hidden layers |
--num-units NUM_UNITS |
150 |
The number of units per hidden layer |
--keep-prob P |
0.6 |
Keep neural network activations with the specified probability, i.e. apply dropout with probability 1-P |
--shuffle |
- | Shuffle instances between each training epoch |
Export a data set into CSV or ARFF format. Cross validation and partition information are represented through the output directory structure, whereas filename, chunk number and labels are stored within the CSV or ARFF files (for a description of these attributes, see Data Model).
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 file containing instances and associated metadata, structured according to our data model. |
--format FORMAT |
required | The output format, one of CSV or ARFF |
--labels-last |
- | By default, metadata is stored in the first four attributes / columns. If this option is set, filename and chunk number will be stored in the first two attributes / columns, and nominal and numeric labels will be stored in the last two attributes / columns. |
--name NAME |
- | Name of generated CSV / ARFF files. By default, the name of the input file is used. |
--output OUTPUT_DIR |
required | The output base directory. |
If neither partition nor cross validation information is present in the data set, this command simply writes all instances to a single file with the same name as the input file to the output base directory. Please note that partial cross validation or partition information will be discarded.
If only partition information is present, the command writes the instances of each partition to subdirectories train
, devel
, and test
below the output base directory. For each partition, a single file with the same name as the input file is written.
If only cross validation information is present, the command writes the instances of each fold to subdirectories fold_<INDEX>
below the output base directory, where <INDEX>
ranges from 1
to the number of folds. For each cross validation fold, a single file with the same name as the input file is written. Please note that in the case of overlapping folds, some instances will be duplicated.
If both partition and cross validation information is present, the command first creates subdirectories train
, devel
, and test
below the output base directory, and then creates subdirectories fold_<INDEX>
below the partition directories, where <INDEX>
ranges from 1
to the number of folds. That is, cross validation folds are assumed to be specified per partition. For each partition and cross validation fold, a single file with the same name as the input file is written to the corresponding directory. Please note that in the case of overlapping folds, some instances will be duplicated.
Import a data set from CSV or ARFF files into netCDF 4. A data set without partition or cross-validation information can be imported if the import base directory contains no directories, and a single CSV or ARFF file with the name passed to the command. Partitioned data can be imported if the base directory contains one directory for each partition of the data set, which in turn contain a single CSV or ARFF file with the name passed to the command. Finally, data with cross-validation information can be imported if the base directory contains one directory for each fold, named fold_N
where N
indicates the fold number. Each of the fold directories must once again contain a single CSV or ARFF file with the name passed to the command.
CSV or ARFF files may contain metadata columns/attributes specifying filename, chunk number, nominal label, and numeric label of instances. At least a nominal label is required currently. Any columns/attributes that are not recognized as metadata are treated as containing numeric features.
Option | Default | Description |
---|---|---|
--basedir BASEDIR |
required | The base directory from which to import data. |
--name NAME |
required | The name of the data set files to import. |
--filename-attribute NAME |
filename |
Name of the filename metadata column/attribute. |
--chunk-nr-attribute NAME |
chunk_nr |
Name of the chunk number metadata column/attribute. |
--label-nominal-attribute NAME |
label_nominal |
Name of the nominal label metadata column/attribute. |
--label-numeric-attribute NAME |
label_numeric |
Name of the numeric label metadata column/attribute. |
Balance classes in some partitions of a data set. Classes are balanced by repeating instances of classes which are underrepresented in comparison to others. Upsampling is performed in such a way that all classes have approximately the same number of instances within the specified partitions.
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 file containing instances and associated metadata, structured according to our data model. |
--partitions PARTITIONS... |
- | One or more partitions in which classes should be balanced. Any partitions not specified here are left unchanged. If not set, the entire data set is upsampled. |
--output OUTPUT_FILE |
required | The output filename. |
Modify data set metadata in various ways.
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 file containing instances and associated metadata, structured according to our data model. |
--output OUTPUT_FILE |
required | The output filename. |
--add-cv-setup NUM_FOLDS |
- | Randomly generate a cross-validation setup for the data set with NUM_FOLDS evenly-sized non-overlapping folds. |
--remove-cv-setup |
- | Remove cross-validation information |
--add-partitioning PARTITIONS... |
- | Randomly generate a partitioning setup for the data set with the specified evenly-sized partitions. |
--remove-partitioning |
- | Remove partition information |
Combine several data sets by concatenation along a feature dimension. This command ensures data integrity, by refusing to fuse data sets that have different metadata for one or more instances.
Option | Default | Description |
---|---|---|
--input INPUT_FILES... |
required | Two or more netCDF 4 files containing instances and associated metadata, structured according to our data model. |
--dimension DIMENSION |
generated |
The dimension to concatenate features along. Defaults to generated , which is the dimension name used when generating features. In order to view feature dimensions of a data set, the audeep inspect command can be used. |
--output OUTPUT_FILE |
required | The output filename. |
Combine audio files which have been split into chunks during spectrogram extraction into single instances, by concatenation along a feature dimension.
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 files containing instances and associated metadata, structured according to our data model. |
--dimension DIMENSION |
generated |
The dimension to concatenate features along. Defaults to generated , which is the dimension name used when generating features. In order to view feature dimensions of a data set, the audeep inspect command can be used. |
--output OUTPUT_FILE |
required | The output filename. |
Display information about the audio files in a data set that has not yet been converted to netCDF 4 format. This command displays information such as the minimum and maximum length of audio files in the data set, or the different sample rates of audio files in the data set.
Option | Default | Description |
---|---|---|
--basedir BASEDIR |
required | The data set base directory |
--parser PARSER |
audeep.backend.parsers.MetaParser |
A parser class for the data set file structure. Parsers are used to detect audio files in a data set, and associate metadata such as labels with the audio files. For an overview of available parsers, and how to implement your own parser, see Parsers. For an overview of our data model, see Data Model. |
Display data set metadata stored in a netCDF 4 file structured according to our data model.
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 file containing instances and associated metadata, structured according to our data model. |
--instance INDEX |
- | Display information about the instance at the specified index in addition to generic data set metadata |
--detailed-folds |
- | Display detailed information about cross-validation folds, if present |
Validate data set integrity constraints. This functionality is provided by a separate command, since integrity constraint checking can be rather time-consuming. For a list of integrity constraints, see Data Model.
Option | Default | Description |
---|---|---|
--input INPUT_FILE |
required | A netCDF 4 file containing instances and associated metadata, structured according to our data model. |
--detailed |
- | Display detailed information about instances violating constraints |
Acoustic data sets are often structured in very different ways. For example, some data sets include metadata CSV files, while others may rely on a certain directory structure to specify metadata information, or even a combination of both. In order to be able to provide a unified feature learning and evaluation framework, we have developed a data model for data sets containing instances with arbitrary feature dimensions and metadata associated with individual instances. Furthermore, we provide a set of parsers for common data set structures, as well as the option to implement custom parsers (see Parsers).
Currently, our data model only supports classification tasks, but we plan on adding support for regression tasks in the near future.
Data sets store instances, which can correspond to either an entire audio file, or a chunk of an audio file. For each instance, the following attributes are stored.
Attribute (Variable Name) | Value Required | Dimensionality | Description |
---|---|---|---|
Filename (FILENAME ) |
yes | - | The name of the audio file from which the instance was extracted |
Chunk Number (CHUNK_NR ) |
yes | - | The index of the chunk which the instance represents. The filename and the chunk number attributes together uniquely identify instances. |
Nominal Label (LABEL_NOMINAL ) |
no | - | Nominal label of the instance. If specified, the numeric label must be specified as well. |
Numeric Label (LABEL_NUMERIC ) |
no | - | Numeric label of the instance. If specified, the nominal label must be specified as well. |
Cross validation folds (CV_FOLDS ) |
yes | number of folds | Specifies cross validation information. For each cross validation fold, this attribute stores whether the instance belongs to the training split (0 ), or the validation split (1 ). We have chosen to represent cross validation information in this way, since we have encountered data sets with overlapping cross validation folds, which can not be represented by simply storing the fold number for each instance. Please note that, while this attribute is required to have a value, this value is allowed to have dimension zero, corresponding to no cross validation information. |
Partition (PARTITION ) |
no | - | The partition to which the instance belongs (0 : training, 1 : development, 2 : test) |
Features (FEATURES ) |
yes | arbitrary | The feature matrix of the instance |
Furthermore, we optionally store a label map, which specifies a mapping of nominal label values to numeric label values. If a label map is given, labels are restricted to values in the map.
Internally, we rely on xarray to represent data sets.
The following integrity constraints must be satisfied by valid data sets.
- Instances with the same filename must have the same nominal labels
- Instances with the same filename must have the same numeric labels
- Instances with the same filename must have the same cross validation information
- Instances with the same filename must have the same partition information
- If a label map is given, all nominal labels must be keys in the map, and all numeric labels must be the associated values
- For each filename, there must be the same number of chunks
- For each filename, chunk numbers must be exactly [0, ..., num_chunks - 1], i.e. each chunk number must be present exactly once.
Parsers read information about an acoustic data set, and generate a list of absolute paths to audio files and their metadata. We provide some parsers for common use cases, and custom parsers can easily be implemented by subclassing audeep.backend.parsers.base.Parser
.
Simply reads all WAV files found below the data set base directory, without parsing any metadata beyond the filename. Please note that the built-in evaluation tools can not be used with data sets generated by this parser.
A meta-parser which decides intelligently which of the parsers listed below should be used to parse a data set. This parser is the default parser used by the audeep preprocess
command. Please note that this parser does explicitly not include the NoMetadataParser
, since that parser could parse any of the data sets the other parsers can process. If you want to use the NoMetadataParser
, specify it explicitly using the --parser
option of the audeep preprocess
command.
Parses the development data set of the 2017 DCASE Acoustic Scene Classification challenge. This parser requires the following directory structure below the data set base directory.
. Data set base directory
├─── audio Directory containing audio files
| └─── ...
├─── evaluation_setup Directory containing cross validation metadata
| ├─── fold1_train.txt Training partition instances for fold 1
| ├─── fold2_train.txt Training partition instances for fold 2
| ├─── fold3_train.txt Training partition instances for fold 3
| ├─── fold4_train.txt Training partition instances for fold 4
| └─── ...
├─── meta.txt Global metadata file
└─── ...
Parses the ESC-10 and ESC-50 data sets. This parser requires that audio files are sorted into separate directories according to their class. Each class directory name must adhere to the regex ^\d{3} - .+
, i.e. the name must start with three digits, followed by the string " - "
, followed by the class name. Furthermore, audio file names must start with a digit, which is interpreted as the cross validation fold number.
Parses the UrbanSound8K data set. This parser requires the following directory structure below the data set base directory.
. Data set base directory
├─── audio Directory containing audio files
| └─── ...
├─── metadata Directory containing metadata
| └─── UrbanSound8K.csv Metadata file
└─── ...
A general purpose parser for data sets which have training, development, and/or test partitions. This parser requires that there are one or more partition directories with names train
, devel
, or test
below the data set base directory, containing another layer of directories indicating the labels of the audio files. The latter directories may only contain WAV files. A simple example of a valid directory structure would be as follows.
. Data set base directory
├─── train Directory containing audio files from the training partition
| ├─── classA Directory containing WAV files with label 'classA'
| | └─── ...
| └─── classB Directory containing WAV files with label 'classB'
| └─── ...
└─── devel Directory containing audio files from the development partition
├─── classA Directory containing WAV files with label 'classA'
| └─── ...
└─── classB Directory containing WAV files with label 'classB'
└─── ...
A general purpose parser for data sets which have cross validation information. This parser requires that there are two or more directories with names fold_N
, where N
indicates the fold number, below the data set base directory. Each of these must contain another layer of directories indicating the labels of the audio files. The latter directories may only contain WAV files. A simple example of a valid directory structure would be as follows.
. Data set base directory
├─── fold_1 Directory containing audio files from fold 1
| ├─── classA Directory containing WAV files with label 'classA'
| | └─── ...
| └─── classB Directory containing WAV files with label 'classB'
| └─── ...
└─── fold_2 Directory containing audio files from fold 2
├─── classA Directory containing WAV files with label 'classA'
| └─── ...
└─── classB Directory containing WAV files with label 'classB'
└─── ...