Overview & Pipeline
Part A starts with manually clicked correspondences to recover homographies with least squares, resamples the second image onto the reference canvas with both nearest-neighbor and bilinear interpolation, and composites them with feathered alpha masks. Part B swaps the manual correspondences for an automated pipeline: Harris responses are thinned with adaptive non-maximal suppression, each surviving corner gets a blurred 40×40 window that is subsampled to an 8×8 descriptor, Lowe-ratio matching proposes pairs, and four-point RANSAC estimates a homography whose inliers feed the same warping and blending code.
A.1: Shoot and Digitize Pictures
Each dataset was captured handheld with the camera pivoting about a fixed center of projection. I kept successive views closely spaced to preserve strong feature overlap.
A.2: Recover Homographies
I collect correspondences with a browser tool and have them in
(row,column) coordinates. Image 1 is always the reference frame. For
each correspondence I contribute two equations to the linear system:
one enforcing the warped column and one enforcing the warped row. The
eight unknowns live in the first two rows and last column of the
homography, so the coefficients use the image 2 coordinates while the
right-hand side stores the image 1 target. Rejecting the trivial zero
solution means moving the target pixels to the right-hand side and
solving an overdetermined
Ah = b. I solve it with
np.linalg.lstsq(A, b)[0], append a trailing
1 to fix the scale, and reshape the result into a
3×3 matrix. At least four correspondences are required, but I
typically click eight or more to stabilize the fit.
\[ \underbrace{ \begin{bmatrix} \mathrm{im2x}_1 & \mathrm{im2y}_1 & 1 & 0 & 0 & 0 & -\mathrm{im2x}_1 \,\mathrm{im1x}_1 & -\mathrm{im2y}_1 \,\mathrm{im1x}_1 \\ 0 & 0 & 0 & \mathrm{im2x}_1 & \mathrm{im2y}_1 & 1 & -\mathrm{im2x}_1 \,\mathrm{im1y}_1 & -\mathrm{im2y}_1 \,\mathrm{im1y}_1 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathrm{im2x}_n & \mathrm{im2y}_n & 1 & 0 & 0 & 0 & -\mathrm{im2x}_n \,\mathrm{im1x}_n & -\mathrm{im2y}_n \,\mathrm{im1x}_n \\ 0 & 0 & 0 & \mathrm{im2x}_n & \mathrm{im2y}_n & 1 & -\mathrm{im2x}_n \,\mathrm{im1y}_n & -\mathrm{im2y}_n \,\mathrm{im1y}_n \end{bmatrix} }_{A} \underbrace{ \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} }_{\mathbf{h}} = \underbrace{ \begin{bmatrix} \mathrm{im1x}_1 \\ \mathrm{im1y}_1 \\ \vdots \\ \mathrm{im1x}_n \\ \mathrm{im1y}_n \end{bmatrix} }_{\mathbf{b}} \]
\(\mathrm{im2x}_i\) and \(\mathrm{im2y}_i\) are the column/row of correspondence \(i\) in the second image, while \(\mathrm{im1x}_i\) and \(\mathrm{im1y}_i\) are the matching coordinates in the reference image.
\[ H = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \]
I set the bottom-right entry to \(1\) after solving so the recovered vector reshapes cleanly into a homography matrix.
A.3: Warp the Images
When rectifying a single object I bound the warp to the reference
quadrilateral supplied with the correspondences; otherwise I run all
four original image corners through H to predict the
warped bounding box. Each output pixel walks backwards through the
transform by multiplying with H^{-1}, dividing by the
homogeneous scale, and sanity checking the source coordinates. The
nearest-neighbor branch rounds those coordinates, while the bilinear
branch retrieves the surrounding four samples and blends them with the
rectangle-area weights computed in
bilinear_interp_helper. The output array is clipped to
[0, 255] and converted to uint8 before
saving. On the game box example the timings are roughly
0.32 s with nearest neighbor and
0.74 s with bilinear, illustrating the quality/speed
trade-off. On the couch rectification the end-to-end warp spends about
0.31 s with nearest neighbor versus
0.74 s with bilinear. The bilinear version has much
better visual quality (ex: smoother edges), so I utilized it for part
B of this project.
A.4: Blend Images into a Mosaic
I read both images with their alpha channels intact, recover
H, and warp the partner frame with bilinear
interpolation. To align the two layers I push each image 2
correspondence through H, compare it with the reference
correspondence, and take the mean row/column offsets as a global
shift. That shift determines how both images are stamped onto a shared
canvas. For feathering I feed each alpha mask to
distance_transform_edt, normalize the two distance maps
so their per-pixel sum is one, broadcast the weights across the color
channels, and blend. Every recovered homography is also written to
disk for inspection.
B.1: Harris Corner Detection
I reuse the provided Harris detector, then feed its coordinate list to
my adaptive non-maximal suppression routine (ANMS). Corners are sorted
by response strength, and each point records the distance to the
nearest stronger corner whose response exceeds
c × r_i (here c = 0.9). The 300
corners with the largest suppression radii survive, spreading the
interest points across the frame. The heat map shows the raw Harris
response; the overlays compare the dense detections with the ANMS
selection. The parameter c controls how similar in
strength a nearby corner can be before it eliminates the current one,
so pushing c toward 1.0 keeps only contenders that are
noticeably stronger. We can see that after ANMS, we have a much
smaller number of interest points rather than the barrage we initially
had.
B.2: Feature Descriptor Extraction
Around every ANMS corner I crop a 40×40 RGB window, apply a
Gaussian blur with σ = 1, and subsample the result
onto an 8×8 grid by stepping every five pixels. Each descriptor
is individually normalized to zero mean and unit standard deviation so
Euclidean distance compares structure rather than raw brightness.
Below are sample window/descriptor pairs from the High Floor View of Berkeley
sequence. The window and descriptor sizes trade off context versus
descriptor length—40×40 retains surrounding texture, while
8×8 keeps comparisons lightweight.
B.3: Feature Matching
Each descriptor from image 1 is compared against every descriptor from
image 2 with Euclidean distance. I retain both the closest and
second-closest matches; candidates pass the Lowe ratio test when
d\_1 / d\_2 < 0.8, meaning the best match is
significantly better than the runner-up. The visualizations show both
the surviving correspondences and a few of the paired windows that led
to them. The Lowe ratio threshold itself measures how dominant the
best neighbor must be—lower ratios demand greater separation and
reject ambiguous matches.
B.4: RANSAC and Automatic Mosaics
Four-point RANSAC repeatedly samples match quadruples (6k iterations for the Berkeley datasets, 7k for the house), builds a candidate homography, and reprojects every correspondence from image 2 into the reference view to measure the forward reprojection error. Points whose error falls below the dataset-specific threshold (listed with each result) are marked as inliers. The densest inlier set triggers a final least-squares fit using all of its correspondences, and the resulting homography flows through the same bilinear warp and feathered blend from Part A. Each mosaic compares the manual baseline with the fully automatic output. The RANSAC iteration count dictates how long the algorithm hunts for a good consensus, while the epsilon threshold sets the maximum reprojection error—measured in pixels—that an inlier is allowed to have.
Reflection
Implementing part A highlighted to me the importance of having high quality correspondences over merely a high quantity of correspondences. The creation of an output grid where the reference image was placed in such a way that the warped image would overlap was also another point of challenge that I worked through. For part B, I had an issue with setting an error threshold that was too high, which caused the RANSAC algorithm to not find a good consensus. I learned the importance of tuning these thresholds carefully (erring on the side of prioritizing accuracy).