Contents
A basic understanding of NeRF’s workings through visual representations
Who should read this article?
This article aims to provide a basic beginner level understanding of NeRF’s workings through visual representations. While various blogs offer detailed explanations of NeRF, these are often geared toward readers with a strong technical background in volume rendering and 3D graphics. In contrast, this article seeks to explain NeRF with minimal prerequisite knowledge, with an optional technical snippet at the end for curious readers. For those interested in the mathematical details behind NeRF, a list of further readings is provided at the end.
What is NeRF and How Does It Work?
NeRF, short for Neural Radiance Fields, is a 2020 paper introducing a novel method for rendering 2D images from 3D scenes. Traditional approaches rely on physics-based, computationally intensive techniques such as ray casting and ray tracing. These involve tracing a ray of light from each pixel of the 2D image back to the scene particles to estimate the pixel color. While these methods offer high accuracy (e.g., images captured by phone cameras closely approximate what the human eye perceives from the same angle), they are often slow and require significant computational resources, such as GPUs, for parallel processing. As a result, implementing these methods on edge devices with limited computing capabilities is nearly impossible.
NeRF addresses this issue by functioning as a scene compression method. It uses an overfitted multi-layer perceptron (MLP) to encode scene information, which can then be queried from any viewing direction to generate a 2D-rendered image. When properly trained, NeRF significantly reduces storage requirements; for example, a simple 3D scene can typically be compressed into about 5MB of data.
At its core, NeRF answers the following question using an MLP:
What will I see if I view the scene from this direction?
This question is answered by providing the viewing direction (in terms of two angles (θ, φ), or a unit vector) to the MLP as input, and MLP provides RGB (directional emitted color) and volume density, which is then processed through volumetric rendering to produce the final RGB value that the pixel sees. To create an image of a certain resolution (say HxW), the MLP is queried HxW times for each pixel’s viewing direction, and the image is created. Since the release of the first NeRF paper, numerous updates have been made to enhance rendering quality and speed. However, this blog will focus on the original NeRF paper.
Step 1: Multi-view input images
NeRF needs various images from different viewing angles to compress a scene. MLP learns to interpolate these images for unseen viewing directions (novel views). The information on the viewing direction for an image is provided using the camera’s intrinsic and extrinsic matrices. The more images spanning a wide range of viewing directions, the better the NeRF reconstruction of the scene is. In short, the basic NeRF takes input camera images, and their associated camera intrinsic and extrinsic matrices. (You can learn more about the camera matrices in the blog below)
Step2 to 4: Sampling, Pixel iteration, and Ray casting
Each image in the input images is processed independently (for the sake of simplicity). From the input, an image and its associated camera matrices are sampled. For each camera image pixel, a ray is traced from the camera center to the pixel and extended outwards. If the camera center is defined as o, and the viewing direction as directional vector d, then the ray r(t) can be defined as r(t)=o+td where t is the distance of the point r(t) from the center of the camera.
Ray casting is done to identify the parts of the scene that contribute to the color of the pixel.