Audio classification of sounds “in the wild”, i.e., in auditory environments in which they would typically occur, remains a challenging yet relevant task. In this thesis, we propose a dual-stream CNN architecture followed by a Label Embeddings Projection (LEP) for audio classification. With these components, our network is able to approximate audio data while also harnessing semantic information from textual class label embeddings. The contributions of this thesis are twofold: First, we improve upon the state of the art in audio classification presented in Kazakos et al. (2021) with our addition of the Label Embeddings Projection. Second, to introduce explainability for our model, we also propose a gradient-based method for reconstructing the audio that the network finds to be most salient.
- Requirements:
- Add this repository to $PYTHONPATH.
export PYTHONPATH=/path/to/auditory-slow-fast/slowfast:$PYTHONPATH
- VGG-Sound:
To train the model run:
python tools/ --cfg configs/VGG-Sound/SLOWFAST_R50.yaml NUM_GPUS num_gpus
OUTPUT_DIR /path/to/output_dir VGGSOUND.AUDIO_DATA_DIR /path/to/dataset
VGGSOUND.ANNOTATIONS_DIR /path/to/annotations
To validate the model run:
python tools/ --cfg configs/VGG-Sound/SLOWFAST_R50.yaml NUM_GPUS num_gpus
OUTPUT_DIR /path/to/experiment_dir VGGSOUND.AUDIO_DATA_DIR /path/to/dataset
TEST.CHECKPOINT_FILE_PATH /path/to/experiment_dir/checkpoints/checkpoint_best.pyth
A base framework for the dual-stream CNN portion of the model architecture, as well as some of the documentation above, was created by:
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen, Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021
The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.