This repository contains a series of Jupyter notebooks designed to guide through the process of data preparation for building a humpback whale vocalization model. The notebooks cover everything from setting up the development environment to data acquisition, data revision, and data preprocessing.
This repository uses the labeled data of humpback whale vocalizations from Orcasound's AWS open data repository. The dataset was prepared by Emily Vierling. It includes ~9,000 labels and is based on ~YY hours of audio data from 3 days during October 03-28, 2021.
Humpback whales are known for their complex vocalizations. Understanding these vocalizations can provide valuable insights into their behavior, social structure, and even their emotional states. This project aims to facilitate the building of a machine learning model to predict and retrieve humpback whale vocalizations from raw audio files.
- Python 3.x
- IDE: Jupyter Notebook, Jupyter Lab, Visual Studio Code, web IDE (e.g. Google Colaboratory) or any other
If you are using a local development environment, please follow steps below:
-
Clone this repository:
git clone https://github.com/LianaN/local_humpback_vocalization.git
-
Navigate to the project directory:
cd local_humpback_vocalization
-
Create a new Python virtual environment:
python -m venv venv source venv/bin/activate
-
Install the required packages:
pip install -r requirements.txt
-
Launch your preferred IDE to access the notebooks
-
Or if you are using linux, simply:
jupyter notebook notebooks/
If you are using Google Colaboratory as your web IDE, please follow instructions from notebooks/0_dev_environment_setup.ipynb
to get started.
Note: Execute this notebook only if you are using Google Colaboratory as your development environment.
This notebook guides through setting up the development environment on Google Colaboratory. It includes instructions for installing necessary packages and setting up Google Drive for data storage.
This notebook covers the steps required to acquire humpback whale vocalization data. It includes code for downloading the datasets (annotation and raw audio files).
In this notebook, the starter code for the revision of the acquired data is provided. This includes visualizing audio waveforms, listening to audio samples, and identifying potential issues in the dataset.
This notebook focuses on extracting the humpback whales vocalizations from raw audio data to prepare the data for machine learning.
Contributions are welcome! Please read the CONTRIBUTING.md for details on how to contribute to this project.
For any questions or concerns, please open an issue or submit a pull request. Happy modeling!