Overview & Pipeline

This project recreates the full NeRF stack from data capture to novel view synthesis. I calibrated my phone with ArUco targets, solved for poses, and undistorted 39 shots of a LEGO minifigure and my Yoda figurine. A pair of PyTorch models then learn continuous radiance fields: first in 2D with positional-encoded MLPs, then in 3D with ray marching, volumetric rendering, and view-dependent colors.

Part 0: Calibrating the Camera & Building a Dataset

The calibration pipeline detects 4×4 ArUco targets, keeps all correspondences batched, and resizes the captures to 200×200 RGB crops before undistorting and packaging them with intrinsics/extrinsics. The flow mirrors the spec exactly: detect tags, accumulate object/image correspondences, calibrate the camera intrinsics, and only then solve for per-image poses.

0.1–0.2: Calibration and Capture

The calibration process loops through 30+ phone images, uses OpenCV's ArUco detector to find tags, stores only detections whose IDs are in the measured rig, and skips frames where markers fail to appear (a common issue noted in the spec). Each tag's corners are expressed in meters so the resulting intrinsics are ready for downstream pose estimation.

0.3: Pose Estimation & Visualization

After calibrating I iterate over every capture, solve for the camera pose using Perspective-n-Point (PnP), invert the resulting extrinsic to a camera-to-world matrix, and (optionally) stream frustums to Viser for inspection.

PnP → extrinsics (OpenCV)

\[ \text{solvePnP} \Rightarrow (\mathbf{rvec}, \mathbf{t}) \Rightarrow \mathbf{R}=\text{Rodrigues}(\mathbf{rvec}),\; \mathbf{w2c}= \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ \mathbf{0}^\top & 1 \end{bmatrix},\; \mathbf{c2w}=\mathbf{w2c}^{-1} \tag{E1} \]

Source: Project 4 Spec Part 0.3 (Pose estimation)

0.4: Undistortion + Dataset Packaging

Distortion coefficients from calibration drive an undistortion pass before each frame is centered, resized, and saved. The processing pipeline crops a center square, applies undistortion (including the optional optimal new camera matrix to handle black boundaries), and drops the updated principal point into the dataset file. The paper with the ArUco tag was not taped down, so corners can lift slightly, but the undistortion handles it without artifacts. The final dataset stores randomized train/val/test splits, camera poses, RGBs, and focal length for later stages.

Part 1: 2D Neural Field Warm-Up

I implemented positional encoding and an image-fitting MLP. Each batch samples 10k pixels, encodes normalized (x, y) coordinates with sinusoidal frequencies, and regresses RGB values via a 4-layer ReLU network. Adam with a 1e-2 learning rate minimizes MSE while PSNR tracks reconstruction quality.

Sinusoidal positional encoding (2D)

\[ \gamma(\mathbf{x})= \Big[ \mathbf{x},\; \{\sin(2^k\pi x),\cos(2^k\pi x)\}_{k=0}^{L-1},\; \{\sin(2^k\pi y),\cos(2^k\pi y)\}_{k=0}^{L-1} \Big] \tag{E2} \]

Source: Project 4 Spec Part 1 (Sinusoidal PE)

The implementation mirrors the spec: random coordinate sampling keeps memory bounded, PE lifts 2D pixels into a 42-D vector (L=10) which enables the network to learn high-frequency details like fur texture and fine-grained patterns, a Sigmoid clamps colors, and a lightweight dataloader interleaves the provided fox target with my own sand photo.

Training Progression

Hyperparameter Sweep (2×2 Grid)

Varying positional encoding frequency (L) and layer width reveals that frequency has a stronger impact on reconstruction quality than width: reducing L from 10 to 5 causes more detail loss than halving width from 256 to 128, suggesting that high-frequency encoding is crucial for capturing fine textures.

PSNR Curve

PSNR from MSE

\[ \text{PSNR} = 10\log_{10}\Big(\frac{1}{\text{MSE}}\Big) \tag{E3} \]

Source: Project 4 Spec Part 1 (PSNR metric)

Part 2: Multi-view NeRF on the LEGO Scene

The NeRF implementation combines batched ray sampling, a view-dependent MLP, and a differentiable volume renderer. I trained on the provided LEGO dataset (100 training images, 10 validation, 60 test) with stratified sampling (near=2, far=6), 64 samples per ray, and 10k rays per iteration. Optimization follows the spec (Adam, lr=5e-4) while monitoring validation PSNR. Training was performed on a MacBook M1 Pro using PyTorch with MPS (Metal Performance Shaders) backend. To keep the math transparent, I restated the spec's camera relations so every later step references the same symbols.

Camera & projection

World→Camera (extrinsic):

\[ \mathbf{x}_c = \mathbf{R}\,\mathbf{x}_w + \mathbf{t}, \quad \mathbf{w2c} = \begin{bmatrix} \mathbf{R} & \mathbf{t}\\ \mathbf{0}^\top & 1 \end{bmatrix}, \quad \mathbf{c2w} = \mathbf{w2c}^{-1} \tag{E4} \]

Camera intrinsics:

\[ \mathbf{K}= \begin{bmatrix} f_x & 0 & c_x\\ 0 & f_y & c_y\\ 0 & 0 & 1 \end{bmatrix} \tag{E5} \]

Projection (pinhole):

\[ s\begin{bmatrix}u\\v\\1\end{bmatrix} = \mathbf{K}\begin{bmatrix}x_c\\y_c\\z_c\end{bmatrix}, \quad s = z_c \tag{E6} \]

Source: Project 4 Spec Part 2.1 (Create Rays from Cameras)

Starting from these relations, I invert \(\mathbf{K}\) to back-project each pixel center (with the spec's +0.5 offset), then push those points through the per-image pose to recover world-space rays. The two callouts under "Sections 2.1–2.3" expand on that back-projection and direction normalization, so the visualization grids have the exact algebra they reference.

Back-projection (pixels → camera)

\[ \tilde{\mathbf{u}}=\begin{bmatrix}u\\v\\1\end{bmatrix},\quad \mathbf{x}_c(s)=s\,\mathbf{K}^{-1}\tilde{\mathbf{u}} \tag{E7} \]

Source: Project 4 Spec Part 2.1 (Pixel → camera conversion)

Ray construction

\[ \mathbf{o}=\text{c2w}_{1:3,\,4},\quad \mathbf{p}_w=\mathbf{c2w} \begin{bmatrix} \mathbf{x}_c(1)\\ 1 \end{bmatrix},\quad \mathbf{d}=\frac{\mathbf{p}_w-\mathbf{o}}{\|\mathbf{p}_w-\mathbf{o}\|} \tag{E8} \]

Source: Project 4 Spec Part 2.1 (Pixel → ray)

Equation E7 undoes the pinhole projection so pixels become camera-space points, and E8 uses the camera-to-world pose to expose both the origin and the normalized ray direction that drive the visualization grids below.

Stratified sampling along rays

\[ t_i = n + \Big(\frac{i+\epsilon_i}{N}\Big)(f-n),\; \epsilon_i\sim\mathcal{U}(0,1),\; \mathbf{x}_i=\mathbf{o}+t_i\,\mathbf{d}, \; i=0,\dots,N-1 \tag{E9} \]

Source: Project 4 Spec Part 2.2 (Sampling Points along Rays)

Here \(n=2\) and \(f=6\) are the LEGO near/far planes, \(N=64\) is the number of bins, and the \(\epsilon_i\) jitter keeps each iteration from sampling the exact same depths, which is precisely the stratified trick the spec calls out.

NeRF MLP I/O

\[ (\sigma_i,\;\mathbf{c}_i) = F_\theta\big(\,\gamma(\mathbf{x}_i),\;\gamma(\mathbf{d})\,\big) \tag{E10} \]

Source: Project 4 Spec Part 2.4 (NeRF network)

\(F_\theta\) is the NeRF network: a deeper MLP than Part 1 that first predicts density \(\sigma_i \ge 0\) from the encoded 3D point and then conditions the color head on the encoded viewing direction to capture view-dependent highlights.

I keep the spec's skip connection (concatenate the positional encoding back into layer five) so geometric detail survives the deep trunk, and I feed the renderer the same transmittance-and-alpha quantities shown below so the provided reference assertion passes verbatim.

Volume rendering

\[ C(\mathbf{r})=\int_{t=n}^{f} T(t)\,\sigma(t)\,\mathbf{c}(t)\,dt,\quad T(t)=\exp\Big(-\!\int_{n}^{t}\sigma(s)\,ds\Big) \tag{E11a} \]

Discrete form (with \(\delta_i=t_{i+1}-t_i\)):

\[ \hat{\mathbf{C}}=\sum_{i=1}^{N} T_i\,\alpha_i\,\mathbf{c}_i,\quad \alpha_i=1-e^{-\sigma_i\delta_i},\quad T_i=\prod_{j<i}(1-\alpha_j) \tag{E11b} \]

Source: Project 4 Spec Part 2.5 (Volume rendering)

\(T(t)\) is the accumulated transparency up to depth \(t\); in the discrete case \(T_i\) multiplies all "the ray survived so far" factors, while \(\alpha_i\) captures the chance the ray terminates in interval \(i\). Weighting each color \(\mathbf{c}_i\) by \(T_i\alpha_i\) makes the renderer physically interpretable and matches the spec's sanity check.

Ray + Sample Visualization

Training Progression

Metrics

Spherical Rendering Video

Novel view synthesis on unseen test camera poses demonstrates the NeRF's ability to generalize beyond the training views.

Part 2.6: Training on My Yoda Dataset

Using the self-captured dataset (31 training images, 5 validation, 3 test), I reconfigured the pipeline for a closer near/far range (0.02–0.5 m) to match the object's scale in meters, increased rays per image to 250 (from 100) to better capture view-dependent reflections on glossy surfaces, increased learning rate to 8e-4 (from 5e-4) to accelerate convergence, and extended training to 6k iterations. The network width was kept at 256 units to handle the complex specular highlights and glossy textures. Training was performed on a MacBook M1 Pro using PyTorch with MPS (Metal Performance Shaders) backend.

Ray + Sample Visualization

Reflection

Reimplementing NeRF end-to-end highlighted how tightly-coupled each module is: a tiny bug in positional encoding, per-ray sampling, or extrinsic math manifests as blurry renders that resemble optimizer issues. Ray batching and stratified sampling were the trickiest pieces to debug, so Viser plots are invaluable. Collecting my own dataset also forced me to reason about near/far planes, intrinsics after cropping, and exposure consistency—all details that conveniently disappear when working with a curated dataset.