Explainable Audio Representation Learning

Abstract

Audio classification of sounds “in the wild”, i.e., in auditory environments in which they would typically occur, remains a challenging yet relevant task. In this thesis, we propose a dual-stream CNN architecture followed by a Label Embeddings Projection (LEP) for audio classification. With these components, our network is able to approximate audio data while also harnessing semantic information from textual class label embeddings. The contributions of this thesis are twofold: First, we improve upon the state of the art in audio classification presented in Kazakos et al. (2021) with our addition of the Label Embeddings Projection. Second, to introduce explainability for our model, we also propose a gradient-based method for reconstructing the audio that the network finds to be most salient.

Preparation

Requirements:
- PyTorch 1.7.1
- librosa: conda install -c conda-forge librosa
- h5py: conda install h5py
- wandb: pip install wandb
- fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
- simplejson: pip install simplejson
- psutil: pip install psutil
- tensorboard: pip install tensorboard
Add this repository to $PYTHONPATH.

export PYTHONPATH=/path/to/auditory-slow-fast/slowfast:$PYTHONPATH

VGG-Sound:
1. Download the audio. For instructions see here
2. Download train.pkl (link) and test.pkl (link). I converted the original train.csv and test.csv (found here) to pickle files with column names for easier use

Training/validation on VGG-Sound

To train the model run:

python tools/run_net.py --cfg configs/VGG-Sound/SLOWFAST_R50.yaml NUM_GPUS num_gpus 
OUTPUT_DIR /path/to/output_dir VGGSOUND.AUDIO_DATA_DIR /path/to/dataset 
VGGSOUND.ANNOTATIONS_DIR /path/to/annotations

To validate the model run:

python tools/run_net.py --cfg configs/VGG-Sound/SLOWFAST_R50.yaml NUM_GPUS num_gpus 
OUTPUT_DIR /path/to/experiment_dir VGGSOUND.AUDIO_DATA_DIR /path/to/dataset 
VGGSOUND.ANNOTATIONS_DIR /path/to/annotations TRAIN.ENABLE False TEST.ENABLE True 
TEST.CHECKPOINT_FILE_PATH /path/to/experiment_dir/checkpoints/checkpoint_best.pyth

Citation and special thanks

A base framework for the dual-stream CNN portion of the model architecture, as well as some of the documentation above, was created by:

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen, Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021

Project's webpage

arXiv paper

License

The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Explainable Audio Representation Learning

Abstract

Preparation

Training/validation on VGG-Sound

Citation and special thanks

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Explainable Audio Representation Learning

Abstract

Preparation

Training/validation on VGG-Sound

Citation and special thanks

License