cs180: final proj
Neural Radiance Fields
Part 1: Fit a Neural Field to a 2D Image
We use Part 1 as a stepping stone to Part 2. Our goal in Part 1 is to create a neural field that can represent a 2D image. To do so, the neural field (a Multilayer Perceptron (MLP) network with Sinusoidal Positional Encoding (PE)) takes in 2D pixel coordinates and outputs the 3D pixel colors for each coordinate. To train the model, I modified the hyperparameters of the neural network: changed hidden layer size to 1024 and highest frequency level of sinusoidal positional encoding to 20. I trained the model for 3000 iterations, using a learning rate of 0.001.
For each image, I’ve included the training PSNR across iterations plot and the training process visualization. I included other experimentations like how increasing the model size improved results but increasing the highest frequency level didn’t impact results much.
From top to bottom: [Hidden Layer Size = 128, L = 10], [Hidden Layer Size = 1024, L = 10], [Hidden Layer Size = 128, L = 20]
fox.jpg
+
dog.jpg
Part 2: Fit a Neural Radiance Field from Multi-View Images
Building from Part 1, we can now use a neural radiance field to represent a 3D space by inverse rendering multi-view calibrated images. Much of this part refers to the techniques from the NeRF paper.
Part 2.1: Create Rays from Cameras
I implemented three functions in this part, all supporting batched coordinates for future use.
The first function (Camera to World Coordinate Conversion) transforms a point from camera space to world space by appending a fourth dimension of ones to the world coordinates and multiplying it by the camera-to-world transformation matrix like the following:
The second function (Pixel to Camera Coordinate Conversion) converts 2D pixel coordinates to 3D points in the camera space by constructing the intrinsic matrix with focal lengths (,) and principal point (,) and calculating the following:
The third function (Pixel to Ray) generates the ray origin and ray direction for 2D pixel coordinates by calculating the world-to-camera matrix and taking its inverse to use ’s upper left 3x3 corner to calculate the following:
Part 2.2: Sampling
I implemented two functions in this part and saved performance time by vectorizing my second function.
The first function (Sampling Rays from Images) is a part of Part 2.3’s dataloader class and converts the pixel coordinates into ray origins and directions. I sample pixels globally from all images, account for the offset from image coordinate to pixel center by adding 0.5, and convert them to rays using Part 2.1’s Pixel to Ray function.
The second function (Sampling Points along Rays) discretize each ray into samples in the 3D space. I added the ray direction multiplied by different distances to the ray origins, ensuring that when perturbation=True
, I perturb the boundaries to ensure training touches every location along the ray.
Part 2.3: Putting the Dataloading All Together
With my modified dataloader and using viser
, here are some of my verification results (right image is sampling 100 rays, left image is sampling 100 rays across one image):
Part 2.4: Neural Radiance Field
I built the neural radiance field (NeRF) using the network structure below, with the goal of outputting the densities and pixel values when given 3D coordinates (points sampled along ray) and ray directions. The main additions to this structure from Part 1’s are the intermediate injections of the input and splitting the model to output both the density and RGB values.
Part 2.5: Volume Rendering
I implemented the volume rendering function volrend
where given the NeRF’s density and RGB values, it computes the loss by comparing the given with the original pixel values. I used torch.cumsum
and padded densities with 0 in the front due to the summation term.
I trained my model with the following hyperparameters: adam learning rate of 1e-3, 3600 gradient descent steps and sampling 1024 rays each time with 64 samples along each. Here are some intermediate training images as well as the validation set’s MSE and PSNR. As seen, I was able to achieve above 23 PSNR!
Bells and Whistles: Background Color
To render the video with a background color other than black, I modified my volrend
function to multiply by the background color passed into the function (in my case, red) and add it to the pixel color.
Reflection
This project definitely was time-consuming and interesting. I learned a lot and the end results were great but there’s room to improve! I wasn’t able to vectorize everything I wanted, resulting in training taking much longer than expected.