From eab40ff5ba0cface0e0dc72de232e3a3bb784cb4 Mon Sep 17 00:00:00 2001 From: ShahVishrut <53956360+ShahVishrut@users.noreply.github.com> Date: Fri, 14 Jun 2024 01:18:47 -0700 Subject: [PATCH] Created using Colab --- docs/tutorials/DAVIS_Direction.ipynb | 724 +++++++++++++++++++++++++++ 1 file changed, 724 insertions(+) create mode 100644 docs/tutorials/DAVIS_Direction.ipynb diff --git a/docs/tutorials/DAVIS_Direction.ipynb b/docs/tutorials/DAVIS_Direction.ipynb new file mode 100644 index 00000000..74e73e18 --- /dev/null +++ b/docs/tutorials/DAVIS_Direction.ipynb @@ -0,0 +1,724 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Training an SNN Model Using DAVISDATA\n", + "#### By Vishrut Shah (vsshah@ucsc.edu)" + ], + "metadata": { + "id": "eQQM61WutIG2" + } + }, + { + "cell_type": "markdown", + "source": [ + "This tutorial shows how to work with the DAVIS camera dataset using Tonic as well as set up and train a spiking neural network model (snn) using snnTorch to output basic odometry data based on event camera data.\n", + "Runtime: 45-60 minutes (on GPU)" + ], + "metadata": { + "id": "2EAuNA5-N2BB" + } + }, + { + "cell_type": "markdown", + "source": [ + "# 1. Background" + ], + "metadata": { + "id": "dnktWEZCJ4se" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fBN6SwtC_K4Y" + }, + "source": [ + "https://tonic.readthedocs.io/en/latest/generated/tonic.datasets.DAVISDATA.html#\n", + "\n", + "http://rpg.ifi.uzh.ch/davis_data.html" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## 1.1 Event Cameras\n", + "An event camera consists of a set of pixels that record changes in brightness independently and asynchronously. In other words, each pixel stays inactive unless a change in brightness occurs at the location of the pixel, in which case said pixel logs its location, timestamp, and whether the brightness increased or decreased at that point. Due to the nature of this data, with specific pixels activating at specific times, this is well-suited for a Spiking Neural Network, which relies on neurons \"spiking\" or outputting at specific times.\n", + "\n", + "### 1.1.1 Frames\n", + "Throughout this tutorial, I will be using the word \"frames\" a lot. In the context of an event-based camera, a frame groups events into constant time intervals. Each frame is essentially a 2D-array, representing the grid of camera pixels. The number at each pixel in the 2D-array represents the number of events (changes in brightness) at that pixel in that time interval. Oftentimes, such as in this tutorial, there are 2 2D-arrays (channels) for each frame, one for events registering an increase in brightness, and one for events registering a decrease." + ], + "metadata": { + "id": "qCg-OSqFIkbG" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 1.2 DAVISDATA\n", + "The DAVIS (Dynamic and Active-pixel Vision Sensor) event camera dataset consists of several recordings of different scenes. Each recording provides the following data across several timestamps:\n", + "* event camera logs\n", + "* grayscale images\n", + "* inertial measurements\n", + "* position and orientation of camera (ground truth)" + ], + "metadata": { + "id": "qxaTrkyJQk1E" + } + }, + { + "cell_type": "markdown", + "source": [ + "# 2. Installation and Preliminary Setup" + ], + "metadata": { + "id": "bIE6RkQyiKi5" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 2.1 Python Libraries\n", + "Install and import the necessary libraries." + ], + "metadata": { + "id": "j9cDVRfuja1Z" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VY48601qjjXi" + }, + "outputs": [], + "source": [ + "!pip install snntorch --quiet\n", + "!pip install tonic --quiet" + ] + }, + { + "cell_type": "code", + "source": [ + "import snntorch as snn\n", + "from snntorch import surrogate\n", + "from snntorch import functional as SF\n", + "from snntorch import spikeplot as splt\n", + "from snntorch import utils\n", + "\n", + "import tonic\n", + "from tonic.datasets import DAVISDATA\n", + "\n", + "import torch\n", + "import torchvision\n", + "import torch.nn as nn\n", + "from torch.utils.data import DataLoader, random_split\n", + "\n", + "import numpy\n", + "from IPython.display import HTML\n", + "import statistics\n", + "import matplotlib.pyplot as plt" + ], + "metadata": { + "id": "dLU6pIRgsJor" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "##2.2 Dataset Download\n", + "Next, we download the desired recordings using Tonic. The Tonic library greatly simplifies downloading and working with event-based datasets. For this task, we download the main recordings involving translational motion of the camera. This may take several minutes." + ], + "metadata": { + "id": "LTkuvNRGUQtI" + } + }, + { + "cell_type": "code", + "source": [ + "dataset = DAVISDATA(save_to='./data', recording=['shapes_translation', 'poster_translation', 'boxes_translation', 'dynamic_translation'])" + ], + "metadata": { + "id": "HTvyebTTO2eE" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 2.3 Visualization\n", + "Interested in seeing what you just downloaded? The code below allows for visualization of the recordings. Increase the time_window value if you would like to consume less memory. Running this snippet may cause a lot of warnings. Don't worry, this is expected." + ], + "metadata": { + "id": "MNUAsckOUPEv" + } + }, + { + "cell_type": "code", + "source": [ + "which_recording = 0 # Replace with index of recording you want to visualize\n", + "\n", + "transform = tonic.transforms.ToFrame(\n", + " sensor_size = dataset.sensor_size,\n", + " time_window = 50000,\n", + ")\n", + "\n", + "frames = transform(dataset[which_recording][0][0])\n", + "\n", + "animation = tonic.utils.plot_animation(frames);\n", + "HTML(animation.to_jshtml())" + ], + "metadata": { + "id": "napNb-Hc7w72", + "collapsed": true + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here's an image of a single frame of the shapes_translation recording.\n", + "\n", + "\n", + "![download (1).png]()" + ], + "metadata": { + "id": "0MNrwGr7zFwS" + } + }, + { + "cell_type": "markdown", + "source": [ + "# 3. Formatting the Data" + ], + "metadata": { + "id": "KXuoyuSKycwQ" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 3.1 The Issue\n", + "The DAVIS dataset contains a LOT of data. Using each raw recording as a single data point would get highly complex and computationally expensive quite fast, both in terms of the high amount of data in the input, and also the requirement for the model to produce a complex output. For example, consider the target/label of the \"shapes_translation\" recording. We might expect a single number, or a vector of numbers. Let's try printing it out." + ], + "metadata": { + "id": "oPdQQRfAUh7N" + } + }, + { + "cell_type": "code", + "source": [ + "print(dataset[0][1])" + ], + "metadata": { + "id": "56P8RaO3lwxh" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "As we can see, we get a whole dictionary containing the positions and orientations of the camera across thousands of timestamps. Trying to get the model to output all this information from a recording would take a lot of computational resources and time." + ], + "metadata": { + "id": "dOsY0TVuSwvb" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 3.2 Proposed Solution\n", + "\n", + "Instead, let's train our model on something more understandable. To do this, we rearrange the dataset in a custom format that preserves the relevant information, and allows us to have inputs and desired outputs of a much smaller dimension. In this case, the proposed setup is as follows:\n", + "\n", + "From each recording, create several datapoints of the form (frames, direction).\n", + "\n", + "* frames = set of n consecutive frames taken from the recording\n", + "* direction = whether the camera was moving left or right while recording those n frames. Note: other directions are excluded becuase they are not mutually exclusive from left-right\n", + "\n", + "In other words, we have simplified this to a binary classification task.\n", + "\n", + "The input frames come from the event recordings. Where does the actual value for the direction come from (to compare to the model's output and perform gradient descent)? This will come from the dictionary we printed out earlier. We can find the x position of the camera at the timestamp of the first frame and compare that to the x position of the camera at the timestamp of the last frame." + ], + "metadata": { + "id": "t7MYQPhCZSC-" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 3.3 Implementation of Solution\n", + "This first function below is optional, if you would like to compress your frames to save memory." + ], + "metadata": { + "id": "kycuFcFFGOS5" + } + }, + { + "cell_type": "code", + "source": [ + "# This function will only get called if the downsample flag is set to true\n", + "def downsample(frames, factor):\n", + " num_frames, channels, height, width = frames.shape\n", + "\n", + " # Calculate new dimensions\n", + " new_height = height // factor\n", + " new_width = width // factor\n", + "\n", + " # Initialize the downsampled array\n", + " downsampled_frames = numpy.zeros((num_frames, channels, new_height, new_width))\n", + "\n", + " for frame_idx in range(num_frames): # for each frame / channel, divide the array into blocks and sum the values in each block, creating a new array reducing each block into a single value corresponding to the sum\n", + " for channel in range(channels):\n", + " for i in range(new_height):\n", + " for j in range(new_width):\n", + " block = frames[frame_idx, channel, i*factor:(i+1)*factor, j*factor:(j+1)*factor]\n", + " block_sum = numpy.sum(block)\n", + " downsampled_frames[frame_idx, channel, i, j] = block_sum\n", + "\n", + " return downsampled_frames.astype(numpy.int16)" + ], + "metadata": { + "collapsed": true, + "id": "MzmJ7XOax508" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "The code below iterates through each recording, creates frames, and computes the target values for each set of n frames." + ], + "metadata": { + "id": "5oNhTOc0dX8F" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vr0EZpDSj8SU" + }, + "outputs": [], + "source": [ + "downsample = False # Modify if you would like to compress the frames to have a smaller dimension\n", + "downsample_factor = 7 # How much to compress each dimension by\n", + "frames = list()\n", + "velocities = list()\n", + "\n", + "for i in range(len(dataset)): # Iterate through each recording\n", + " groundtruth = dataset[i][1]\n", + "\n", + " mean_diff = numpy.diff(list(zip(groundtruth[\"ts\"], groundtruth[\"ts\"][1:]))).mean() # Calculates the average time between each measurement of the groundtruth\n", + " time_window = 10 * mean_diff # Modify this based on how much time you want each frame to bin events into\n", + " intervalLength = 25 # Modify this based on how many frames you want in each data point\n", + "\n", + " transform = tonic.transforms.ToFrame(\n", + " sensor_size = dataset.sensor_size,\n", + " time_window = time_window,\n", + " )\n", + "\n", + " if downsample:\n", + " frames.append(downsample(transform(dataset[i][0][0]), downsample_factor))\n", + " else:\n", + " frames.append(transform(dataset[i][0][0]))\n", + "\n", + " velocities.append(list())\n", + "\n", + " offset = time_window * intervalLength / mean_diff # How many measurements apart of the motion capture system are the first and last frame\n", + " start = 0\n", + " increment = time_window / mean_diff\n", + "\n", + "\n", + " while int(offset + start * increment) < len(groundtruth[\"point\"]):\n", + " dx = groundtruth[\"point\"][int(offset + start * increment)][0] - groundtruth[\"point\"][int(start * increment)][0] # How much the camera moved across the batch of n frames\n", + " velocities[i].append(0 if dx > 0 else 1) # If the change in x was positive, the camera moved right, otherwise it moved left\n", + " start += 1" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## 3.4 Custom Dataset\n", + "The drawback of the ToFrame() function used above is that it puts all the frames in one list. We do not yet have sets of n consecutive frames. We only have their directions. We need to create several datapoints taking n consecutive frames from this list. One approach we could take is to create a new list that contains all these intervals. The problem with this is that copying this array takes a lot of memory, and we are limited on RAM usage. Instead, we create a CustomDataset class. This formats and creates the data point only when it is needed, significantly saving memory." + ], + "metadata": { + "id": "_PI_g4hcI0-9" + } + }, + { + "cell_type": "code", + "source": [ + "import torch\n", + "from torch.utils.data import Dataset, DataLoader\n", + "\n", + "class CustomDataset(Dataset):\n", + " def __init__(self, frames, velocities, interval_length):\n", + " self.frames = frames\n", + " self.velocities = velocities\n", + " self.interval_length = interval_length\n", + "\n", + " def __len__(self):\n", + " length = 0\n", + " for i in range(len(velocities)):\n", + " length += len(self.velocities[i])\n", + " return length\n", + "\n", + " def __getitem__(self, idx): # Loads and formats each datapoint when necessary\n", + " recording_num = 0\n", + " while idx >= len(self.velocities[recording_num]):\n", + " idx -= len(self.velocities[recording_num])\n", + " recording_num += 1\n", + " start_idx = idx\n", + " end_idx = start_idx + self.interval_length\n", + " frame_batch = self.frames[recording_num][start_idx:end_idx] # creates an interval of n frames\n", + " velocity = self.velocities[recording_num][idx] # pairs the frames with the corresponding displacement across those frames\n", + " return frame_batch, velocity\n", + "\n", + "# Assuming 'frames' and 'velocities' are your data and targets respectively\n", + "dataset = CustomDataset(frames, velocities, interval_length=intervalLength)" + ], + "metadata": { + "id": "nzQwGi8LQsBC" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 3.5 Dataloaders\n", + "Below, we randomly split the dataset into datapoints that will be used to train the model, and datapoints that will be used to test its accuracy." + ], + "metadata": { + "id": "awCLMc_Xfkch" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Se7vJByLUvS9" + }, + "outputs": [], + "source": [ + "# DataLoader parameters\n", + "batch_size = 32\n", + "shuffle = True\n", + "\n", + "train_size = int(0.7 * len(dataset))\n", + "test_size = len(dataset) - train_size\n", + "\n", + "\n", + "train_dataset, test_dataset = random_split(dataset, [train_size, test_size]) # splits the dataset into data used to train the model and data used to test its accuracy\n", + "\n", + "# Create DataLoader\n", + "trainloader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=tonic.collation.PadTensors(batch_first=False), shuffle=shuffle)\n", + "testloader = DataLoader(test_dataset, batch_size=batch_size, collate_fn=tonic.collation.PadTensors(batch_first=False), shuffle=shuffle)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# 4. Training the Model\n", + "The following code initializes the model and the functions that will be used to train it. This structure is adapted from the snnTorch Tutorials. Some things to note: The linear layer of the network has a high input count, due to the large dimensions of the DAVIS Camera frame. This value will have to be modified if you choose to compress the frames. The 2 outputs are for left and right." + ], + "metadata": { + "id": "Q1q-39VjssOv" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 4.1 Model Structure" + ], + "metadata": { + "id": "NpuQRh16f-r4" + } + }, + { + "cell_type": "code", + "source": [ + "device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"mps\") if torch.backends.mps.is_available() else torch.device(\"cpu\")\n", + "\n", + "# neuron and simulation parameters\n", + "spike_grad = surrogate.atan()\n", + "beta = 0.95\n", + "\n", + "# Initialize Network\n", + "net = nn.Sequential(nn.Conv2d(2, 12, 5),\n", + " nn.MaxPool2d(2),\n", + " snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),\n", + " nn.Conv2d(12, 32, 5),\n", + " nn.MaxPool2d(2),\n", + " snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),\n", + " nn.Flatten(),\n", + " nn.Linear(76608, 2),\n", + " snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True, output=True)\n", + " ).to(device)" + ], + "metadata": { + "id": "sxEd_Z2TeLGz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "def forward_pass(net, data):\n", + " spk_rec = []\n", + " utils.reset(net)\n", + "\n", + " for step in range(data.size(0)):\n", + " spk_out, mem_out = net(data[step])\n", + " spk_rec.append(spk_out)\n", + "\n", + " return torch.stack(spk_rec)" + ], + "metadata": { + "id": "4DBMfs8w9v_i" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "optimizer = torch.optim.Adam(net.parameters(), lr=1e-3, betas=(0.9, 0.999))\n", + "loss_fn = SF.mse_count_loss(correct_rate=0.7, incorrect_rate=0.3)" + ], + "metadata": { + "id": "WqBsUYyV90m1" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 4.2 Model Train\n", + "Buckle up, this is going to take a long time. Took about 30 minutes for me." + ], + "metadata": { + "id": "bwAlFzMitcB6" + } + }, + { + "cell_type": "code", + "source": [ + "num_epochs = 25\n", + "\n", + "loss_hist = []\n", + "acc_hist = []\n", + "\n", + "for epoch in range(num_epochs):\n", + " for i, (data, targets) in enumerate(iter(trainloader)):\n", + " data = data.to(device)\n", + " targets = targets.to(device)\n", + "\n", + " net.train()\n", + " spk_rec = forward_pass(net, data)\n", + " loss_val = loss_fn(spk_rec, targets)\n", + "\n", + " optimizer.zero_grad()\n", + " loss_val.backward()\n", + " optimizer.step()\n", + "\n", + " loss_hist.append(loss_val.item())\n", + "\n", + " print(f\"Epoch {epoch}, Iteration {i} \\nTrain Loss: {loss_val.item():.2f}\")\n", + "\n", + " acc = SF.accuracy_rate(spk_rec, targets)\n", + " acc_hist.append(acc)\n", + " print(f\"Accuracy: {acc * 100:.2f}%\\n\")" + ], + "metadata": { + "id": "zVYrw5nF94Js", + "collapsed": true + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Graph the results of the training by running the code below." + ], + "metadata": { + "id": "acsjICA3gFsC" + } + }, + { + "cell_type": "code", + "source": [ + "fig = plt.figure(facecolor=\"w\")\n", + "plt.plot(acc_hist)\n", + "plt.title(\"Train Set Accuracy\")\n", + "plt.xlabel(\"Iteration\")\n", + "plt.ylabel(\"Accuracy\")\n", + "plt.show()" + ], + "metadata": { + "id": "le5xnYnRGeYu" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Hopefully, your graph is increasing over time and looks something like this. It may be better or worse than this depending on how the batches are randomly chosen. If it doesn't look like the accuracy got much better over the iterations, you can try re-randomizing the batches by re-running the code cell with the batch size and DataLoaders.\n", + "\n", + "![download.png]()" + ], + "metadata": { + "id": "xHAphjNYtqre" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 4.3 Model Test\n", + "Next let's try inputting some points that we didn't use for training into the neural network." + ], + "metadata": { + "id": "ejlR1j0laCLl" + } + }, + { + "cell_type": "code", + "source": [ + "net.eval()\n", + "\n", + "batch_accuracy = []\n", + "\n", + "with torch.no_grad():\n", + " for data, targets in testloader:\n", + "\n", + " data = data.to(device)\n", + " targets = targets.to(device)\n", + "\n", + " spk_rec = forward_pass(net, data)\n", + "\n", + " acc = SF.accuracy_rate(spk_rec, targets)\n", + " batch_accuracy.append(acc)\n", + "\n", + " print(f\"Accuracy: {acc * 100:.2f}%\\n\")\n", + "\n", + "print(\"The average accuracy across the testloader is:\", 100 * statistics.mean(batch_accuracy), \"%\")" + ], + "metadata": { + "id": "M0gdp_O7ihtu" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Graph the results of the testing by running the code below." + ], + "metadata": { + "id": "o72Gf_wrgQsS" + } + }, + { + "cell_type": "code", + "source": [ + "fig = plt.figure(facecolor=\"w\")\n", + "plt.plot(batch_accuracy)\n", + "plt.title(\"Test Set Accuracy\")\n", + "plt.xlabel(\"Iteration\")\n", + "plt.ylabel(\"Accuracy\")\n", + "plt.show()\n", + "print(\"The average accuracy across the testloader is:\", 100 * statistics.mean(batch_accuracy), \"%\")" + ], + "metadata": { + "id": "CG3SNH-tvzk_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "![Screen Shot 2024-06-12 at 6.42.02 PM.png]()" + ], + "metadata": { + "id": "bWsn8aG9vHLF" + } + }, + { + "cell_type": "markdown", + "source": [ + "# 5. Results\n", + "That wraps up the tutorial! This was an example of how to use the DAVIS dataset to perform a binary classification task. Some final remarks..." + ], + "metadata": { + "id": "0RByuTn1zhs8" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 5.1 Improving on This Setup\n", + "\n", + "\n", + "\n", + "1. Currently, each set of frames is mapped to a single value of left or right. However, it is completely possible that there is an even split of left and right movement across some sets of frames. It's also possible that the camera moved up and down and barely moved left or right at all in certain intervals. One possible solution is to remove these datapoints.\n", + "2. For the purposes of this tutorial, the model is trained on 4 recordings. It is unclear if the accuracy of the model can be generalized to other recordings, or if the model has overfitted parameters to these specific recordings.\n", + "3. There was overlap over the datapoints, as the set of all possible intervals of n recordings consist of overlapping intervals.\n", + "\n", + "\n" + ], + "metadata": { + "id": "Bplaw540h3mG" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 5.2 Implications\n", + "\n", + "Hooray! We successfully trained a Spiking Neural Network to output direction using event camera data! This type of task has applications in Simultaneous Localization and Mapping (SLAM) algorithms, which track an agent's position and motion within an evironment. With traditional image data being large and computationally expensive, using event cameras and spiking neural networks could be more energy efficient on the large scale." + ], + "metadata": { + "id": "Lp-RwJnYmuCY" + } + } + ], + "metadata": { + "colab": { + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file