Skip to content

My Master's thesis project in audio classification using PyTorch and librosa. I achieve state of the art performance on the VGG-Sound dataset with the addition of a textual embedding layer to an existing dual-stream CNN framework.

License

Notifications You must be signed in to change notification settings

porcelluscavia/audio-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Explainable Audio Representation Learning

Abstract

Audio classification of sounds “in the wild”, i.e., in auditory environments in which they would typically occur, remains a challenging yet relevant task. In this thesis, we propose a dual-stream CNN architecture followed by a Label Embeddings Projection (LEP) for audio classification. With these components, our network is able to approximate audio data while also harnessing semantic information from textual class label embeddings. The contributions of this thesis are twofold: First, we improve upon the state of the art in audio classification presented in Kazakos et al. (2021) with our addition of the Label Embeddings Projection. Second, to introduce explainability for our model, we also propose a gradient-based method for reconstructing the audio that the network finds to be most salient.

Preparation

  • Requirements:
    • PyTorch 1.7.1
    • librosa: conda install -c conda-forge librosa
    • h5py: conda install h5py
    • wandb: pip install wandb
    • fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
    • simplejson: pip install simplejson
    • psutil: pip install psutil
    • tensorboard: pip install tensorboard
  • Add this repository to $PYTHONPATH.
export PYTHONPATH=/path/to/auditory-slow-fast/slowfast:$PYTHONPATH
  • VGG-Sound:
    1. Download the audio. For instructions see here
    2. Download train.pkl (link) and test.pkl (link). I converted the original train.csv and test.csv (found here) to pickle files with column names for easier use

Training/validation on VGG-Sound

To train the model run:

python tools/run_net.py --cfg configs/VGG-Sound/SLOWFAST_R50.yaml NUM_GPUS num_gpus 
OUTPUT_DIR /path/to/output_dir VGGSOUND.AUDIO_DATA_DIR /path/to/dataset 
VGGSOUND.ANNOTATIONS_DIR /path/to/annotations 

To validate the model run:

python tools/run_net.py --cfg configs/VGG-Sound/SLOWFAST_R50.yaml NUM_GPUS num_gpus 
OUTPUT_DIR /path/to/experiment_dir VGGSOUND.AUDIO_DATA_DIR /path/to/dataset 
VGGSOUND.ANNOTATIONS_DIR /path/to/annotations TRAIN.ENABLE False TEST.ENABLE True 
TEST.CHECKPOINT_FILE_PATH /path/to/experiment_dir/checkpoints/checkpoint_best.pyth

Citation and special thanks

A base framework for the dual-stream CNN portion of the model architecture, as well as some of the documentation above, was created by:

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen, Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021

Project's webpage

arXiv paper

License

The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.

About

My Master's thesis project in audio classification using PyTorch and librosa. I achieve state of the art performance on the VGG-Sound dataset with the addition of a textual embedding layer to an existing dual-stream CNN framework.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages