In this post, we will discuss the end-of-line (EOL) camera calibration, especially for camera bird’s-eye view (BEV) extrinsic calibration.

Suggested Pipeline

  1. Do single camera intrinsic calibration

  2. Measure each corner in the world coordinate

  3. Refine the corners’ coordinates according to board constraints (plane, parallel, equally spaced)

  4. Find the plane equation in the world coordinate

  5. Initialize each camera’s extrinsic parameters by solving the perspective-n-point (PnP) problem

  6. Estimate the 3D coordinates by the intersection of the image ray and board plane

  7. Construct the objective loss function (consistency of different cameras with common seen points and single camera’s observation)

Refine the Corner’s Coordinates by the Board Constraints

Assume the center of the board is located at the origin of a local coordinate, $(0,0,0)^\top$, the corner coordinate is located at $\{\mathbf{x}_i\}^N_{i=1}$, then the board normal is $(0,0,1)^\top$. Assume the board is transformed to the world coordinate by rotation, $\mathbf{R}$, and translation, $\mathbf{t}$. We have the observed coordinates $\{\hat{\mathbf{y}}_i\}$. Then we minimize the objectives:

$$\{\mathbf{y}_i^*, \mathbf{R}^*, \mathbf{t}^*\} = \underset{\mathbf{y}_i,\mathbf{R}, \mathbf{t}}{\arg\min}\sum_i \|\hat{\mathbf{y}}_i -\mathbf{y}_i\|^2\\ \text{s.t. } \mathbf{y}_i = \mathbf{R}\mathbf{x}_i + \mathbf{t} $$

This problem can be easily solved since it is a simple Iterative Closest Point (ICP) problem.

Then we can get the center of the board as $\mathbf{x}_0=\mathbf{t}^*$, and the normal of the board is $\mathbf{R}^*[0, 0, 1]^\top$.

Now we have refined board corners in the world coordinates, i.e., $\{\mathbf{y}_i^*\}^N_{i=1}$.

Unknown Board Dimension

If we do not have the board dimension, then we can directly fit the plane using the measured corners, $\{\mathbf{x}_i\}^N_{i=1}$ (with a slight abuse of notation). Assume the plane equation is $\mathbf{n}^\top \mathbf{x} - c = 0$ (with unit normal vector, i.e., $\|\mathbf{n}\| = 1$). Then we can solve for $\mathbf{n}$ and $c$ by minimizing the following objective:

$$ \underset{\mathbf{n}, c}{\text{argmin}} \frac{1}{N}\sum_i \|\mathbf{n}^\top \mathbf{x}_i - c\|^2 \\ \text{s.t. } \|\mathbf{n}\| = 1 $$

For any fixed $\mathbf{n}$, taking derivative with respect to $c$ and setting it to zero, we have:

$$ c = \frac{1}{N} \sum_i \mathbf{n}^\top \mathbf{x}_i = \mathbf{n}^\top \bar{\mathbf{x}} $$

where $\bar{\mathbf{x}} = \frac{1}{N} \sum_i \mathbf{x}_i$ is the mean of the measured corners.

So, the objective function becomes:

$$ \begin{aligned} &\frac{1}{N}\sum_i \|\mathbf{n}^\top (\mathbf{x}_i - \bar{\mathbf{x}})\|^2 \\ &= \frac{1}{N}\mathbf{n}^\top \left(\sum_i (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top\right) \mathbf{n} \\ & = \mathbf{n}^\top \mathbf{\Sigma}_x \mathbf{n} \end{aligned} $$

where $\mathbf{\Sigma}_x = \frac{1}{N} \sum_i (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top$ is the biased sample covariance matrix of the measured corners.

Considering $\|\mathbf{n}\| = 1$, $\mathbf{n}$ is the eigenvector of $\mathbf{\Sigma}_x$ corresponding to its smallest eigenvalue.

The refined corners’ coordinates are the projection of the measured corners on the plane, which can be written as:

$$ \mathbf{y}_i = \mathbf{x}_i - \mathbf{n}^\top (\mathbf{x}_i - \bar{\mathbf{x}}) \mathbf{n} = \mathbf{x}_i - (\mathbf{n}^\top\mathbf{x}_i -c)\mathbf{n}. $$

which has clear geometric meaning: the measured corners minus the projection on the plane normal.

The Intersection of the Image Ray and Board Plane

We are given a 3D plane:

$$\mathbf{n}^\top \mathbf{x} + c = 0,$$

and a camera model with the projection equation:

$$d\, \mathbf{u} = \mathbf{K} (\mathbf{R}\, \mathbf{x} + \mathbf{t}),$$

where:

  • $\mathbf{u} \in \mathbb{R}^3$ is the homogeneous image coordinate (usually $\mathbf{u} = [u, v, 1]^\top$)
  • $d$ is an unknown scale (depth) factor
  • $\mathbf{K}$ is the camera intrinsic matrix
  • $\mathbf{R}$ and $\mathbf{t}$ are the rotation and translation (extrinsic parameters) of the camera
  • $\mathbf{x} \in \mathbb{R}^3$ is the 3D point we wish to recover

Our goal is to find $\mathbf{x}$ in terms of the known quantities $\mathbf{u}, \mathbf{K}, \mathbf{R}, \mathbf{t}, \mathbf{n},$ and $c$.

Rewrite the Projection Equation

$$\mathbf{x} = \mathbf{R}^\top \left(d\,\mathbf{K}^{-1} \mathbf{u} - \mathbf{t}\right).$$

However, the scale factor $d$ is still unknown.

Use the Plane Constraint to Solve for $d$

Since $\mathbf{x}$ lies on the plane, it must satisfy:

$$\mathbf{n}^\top \mathbf{x} + c = 0.$$

Substitute the expression we just found for $\mathbf{x}$:

$$\mathbf{n}^\top \Bigl(\mathbf{R}^\top (d\,\mathbf{K}^{-1} \mathbf{u} - \mathbf{t})\Bigr) + c = 0.$$

Because the scalar product is invariant under transposition, we can write:

Now, solve for $d$:

$$d = \frac{\mathbf{n}^\top \mathbf{R}^\top \mathbf{t} - c}{\mathbf{n}^\top \mathbf{R}^\top \mathbf{K}^{-1} \mathbf{u}}.$$

Note: It is important that $\mathbf{n}^\top \mathbf{R}^\top \mathbf{K}^{-1} \mathbf{u} \neq 0$ so that the division is valid (this means that the back-projected ray is not parallel to the plane).

Substitute the expression for $d$ back into the equation for $\mathbf{x}$:

$$\mathbf{x} = \mathbf{R}^\top \left( \frac{\mathbf{n}^\top \mathbf{R}^\top \mathbf{t} - c}{\mathbf{n}^\top \mathbf{R}^\top \mathbf{K}^{-1} \mathbf{u}}\, \mathbf{K}^{-1} \mathbf{u} - \mathbf{t} \right).$$

Solve the Intersection Problem as an Optimization Problem

We wish to solve

$$ \min_{\mathbf{x}} \Big\| \mathbf{u} - \frac{\mathbf{R}\mathbf{x} + \mathbf{t}}{\mathbf{e}^\top (\mathbf{R}\mathbf{x} + \mathbf{t})} \Big\|^2\tag{1} $$

subject to the plane constraint

$$ \mathbf{n}^\top \mathbf{x} + c = 0, $$

where: • $\mathbf{R}$ is the rotation matrix, • $\mathbf{t}$ is the translation vector, • $\mathbf{n}$ is the normal of the plane, • $c$ is a constant, and • $\mathbf{e} = \begin{pmatrix} 0 \ 0 \ 1 \end{pmatrix}$ (which selects the depth coordinate).

Even when the measured image coordinate $\mathbf{u} \in \mathbb{R}^2$ is noisy, the pinhole camera model combined with the plane constraint induces a projective mapping (a homography) from the plane to the image. In other words, for any $\mathbf{u}$ there exists an entire ray of points $\mathbf{x}$ that project exactly to $\mathbf{u}$. The plane constraint then selects one unique point on that ray.

  1. Parameterizing the Back‐Projection Ray

Any 3D point $\mathbf{x}$ whose projection is $\mathbf{u}$ must satisfy

$$ \frac{\mathbf{R}\mathbf{x} + \mathbf{t}}{\mathbf{e}^\top (\mathbf{R}\mathbf{x} + \mathbf{t})} = \begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}. $$

This implies that there exists a scalar $\lambda$ such that

$$ \mathbf{R}\mathbf{x} + \mathbf{t} = \lambda \begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}. $$

Solving for $\mathbf{x}$ gives

$$ \mathbf{x} = \mathbf{R}^{-1} \Bigl( \lambda \begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix} - \mathbf{t} \Bigr). $$

It is convenient to define the camera center in world coordinates as

$$ \mathbf{x}_c = -\mathbf{R}^{-1}\mathbf{t}, $$

and to introduce the direction vector

$$ \mathbf{d} = \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}. $$

Thus, the 3D point $\mathbf{x}$ can be written as

$$ \mathbf{x} = \mathbf{x}_c + \lambda \mathbf{d}. $$

All points of this form project exactly to $\mathbf{u}$, which means the optimal value of Eq. (1) is zero

  1. Enforcing the Plane Constraint

Since $\mathbf{x}$ must lie on the plane, we impose

$$ \mathbf{n}^\top \mathbf{x} + c = 0. $$

Substitute $\mathbf{x} = \mathbf{x}_c + \lambda\mathbf{d}$ into the plane equation:

$$ \mathbf{n}^\top \Bigl( \mathbf{x}_c + \lambda\mathbf{d} \Bigr) + c = 0. $$

Expanding, we obtain

$$ \mathbf{n}^\top \mathbf{x}_c + \lambda \mathbf{n}^\top \mathbf{d} + c = 0. $$

Recalling that $\mathbf{x}_c = -\mathbf{R}^{-1}\mathbf{t}$, the equation becomes

$$ \mathbf{n}^\top \mathbf{R}^{-1}\mathbf{t} + \lambda \mathbf{n}^\top \mathbf{d} + c = 0. $$

Solving for $\lambda$, we have

$$ \lambda \mathbf{n}^\top \mathbf{d} = \mathbf{n}^\top \mathbf{R}^{-1}\mathbf{t} - c, $$

or equivalently,

$$ \lambda = \frac{\mathbf{n}^\top \mathbf{R}^{-1}\mathbf{t} - c}{\mathbf{n}^\top \mathbf{d}} = \frac{\mathbf{n}^\top \mathbf{R}^{-1}\mathbf{t} - c}{\mathbf{n}^\top \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \ 1 \end{pmatrix}}, $$

provided that $\mathbf{n}^\top \mathbf{d} \neq 0$.

  1. The Final Solution

Substitute the value of $\lambda$ back into the expression for $\mathbf{x}$:

$$ \mathbf{x} = -\mathbf{R}^{-1}\mathbf{t} + \frac{\mathbf{n}^\top \mathbf{R}^{-1}\mathbf{t} - c}{\mathbf{n}^\top \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}} \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}. $$

This is the unique 3D point on the plane $\mathbf{n}^\top \mathbf{x} + c = 0$ that minimizes the reprojection error, even when $\mathbf{u}$ is noisy.

  1. Summary

a. Back‐Projection: The projection model

$$ \mathbf{x} \mapsto \frac{\mathbf{R}\mathbf{x} + \mathbf{t}}{\mathbf{e}^\top (\mathbf{R}\mathbf{x} + \mathbf{t})} $$

is a homography when restricted to the plane. Hence, every point along the ray

$$ \mathcal{R}(\mathbf{u}) = \Bigl\{ \mathbf{x} = -\mathbf{R}^{-1}\mathbf{t} + \lambda \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix} : \lambda \in \mathbb{R} \Bigr\} $$

projects to $\mathbf{u}$.

b. Plane Intersection: The plane constraint $\mathbf{n}^\top \mathbf{x} + c = 0$ uniquely determines the scalar $\lambda$, and hence the unique point $\mathbf{x}$.

Thus, the final answer is

$$ \boxed{ \mathbf{x} = -\mathbf{R}^{-1}\mathbf{t} + \frac{\mathbf{n}^\top \mathbf{R}^{-1}\mathbf{t} - c}{\mathbf{n}^\top \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}} \mathbf{R}^{-1}\begin{pmatrix} \mathbf{u} \\ 1 \end{pmatrix}. } $$

Solve Step 3 and 6 Jointly as a Bundle Adjustment Problem

Suppose the board’s markers coordinate in the board’s frame are $\{ \mathbf{x}_i^g \}_{i=1}^N$ in the board’s frame. The measured marker points in the world frame are $\{ \mathbf{x}_i^o \}_{i=1}^N$. The observed image markers in camera 1 are $\{ \mathbf{u}_i^1 \}_{i=1}^{N_1}$ and the observed image markers in the second camera are $\{ \mathbf{u}_i^2 \}_{i=1}^{N_2}$.

We construct the loss as:

$$ \begin{aligned} \mathcal{L} &= \sum_i \ell_{\mathbf{T}_0, \mathbf{x}_i} (\mathbf{x}_i^g, \mathbf{x}_i) + \sum_i \ell_{\mathbf{x}_i} (\mathbf{x}_i^o, \mathbf{x}_i) \\ & + \sum_i \ell_{\mathbf{x}_i, \mathbf{T}_0, \mathbf{T}_1} \Big( \pi^{-1}(\mathbf{u}_i^1), \mathbf{x}_i \Big) + \sum_i \ell_{\mathbf{x}_i, \mathbf{T}_0, \mathbf{T}_2} \Big( \pi^{-1}(\mathbf{u}_i^2), \mathbf{x}_i \Big) \\ &+ \sum_{i \in C_{1,2}} \ell_{\mathbf{T}_0, \mathbf{T}_1, \mathbf{T}_2} \Big( \pi^{-1}(\mathbf{u}_i^1), \pi^{-1}(\mathbf{u}_i^2) \Big) \end{aligned} $$

Where:

  • the transformation from the local board to the world board, $\mathbf{T}_0$;
  • The world coordinates of the board, $\{\mathbf{x}_i\}$;
  • The transformation from the cameras to the world board, $\mathbf{T}_i$, where $i = 1, 2$;
  • $C_{1,2}$ is the marker set commonly seen by camera 1 and camera 2;
  • $\pi(\cdot)$ is the intersection of the camera ray and the world plane.

In the first part, we put a hard constraint that the true 3D position of the points should be a rigid transformation of the board.

$$\sum_i \ell_{\mathbf{T}_0, \mathbf{x}_i}(\mathbf{x}_i^g, \mathbf{x}_i) = \sum_i I_{\infty}(\mathbf{x}_i - \mathbf{T}_0 \mathbf{x}_i^g)\tag{2}$$

where $I_\infty(x)= \begin{cases} 0, \text{ if } x=0, \\ \infty, \text{otherwise} \end{cases}$.

The second part of the loss is the difference between the observations of 3D points and the true position.

$$\sum_i \ell_{\mathbf{x}_i}(\mathbf{x}_i, \mathbf{x}_i^o) = \sum_i \|\mathbf{x}_i - \mathbf{x}_i^o\|^2 \tag{3}$$

The third part is the perspective-n-point loss

$$\sum_{i\in C_1} \ell_{\mathbf{x}_i, \mathbf{T}_0,\mathbf{T}_1}\left(\pi^{-1}(\mathbf{u}_i^1), \mathbf{x}_i \right) = \sum_{i\in C_1} \| \pi^{-1}(\mathbf{u}_i^1) - \mathbf{x}_i \|^2\tag{4}$$$$\sum_{i\in C_2} \ell_{\mathbf{x}_i, \mathbf{T}_0,\mathbf{T}_2}\left(\pi^{-1}(\mathbf{u}_i^2), \mathbf{x}_i \right) = \sum_{i\in C_2} \| \pi^{-1}(\mathbf{u}_i^2) - \mathbf{x}_i \|^2\tag{5}$$

The last one is the consistency loss. Here, we can simply put the $\ell_2$ loss on the points. Other losses are possible, e.g., image similarity loss, etc.

$$\sum_{i \in C_{1,2}} \ell_{\mathbf{T}_0, \mathbf{T}_1, \mathbf{T}_2}\left(\pi^{-1}(\mathbf{u}_i^1),\, \pi^{-1}(\mathbf{u}_i^2)\right) = \sum_{i\in C_{1,2}} \|\pi^{-1}(\mathbf{u}_i^1) - \pi^{-1}(\mathbf{u}_i^2)\|^2 \tag{6}$$

Simulations

We construct a 8x8 chessboard with each square’s size be of 1m. We set the the origin as the center of the board. We place the board in the world coordinate with extrinsic:

$$\begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 8 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$

which means the normal of the board plane is $[0, 0, 1]^\top$.

We add Gaussian noise on the corners coordinates with noise zero mean and $0.005$ standard deviation.

With the noisy corners’ coordinates, we can get the optimized board’s extrinsic according to step 3. The optimized normal is $[1.38*10^{-3}, -2.16*10^{-5}, 9.999*10^{-1}]^\top$. That is already very close the ground truth.

Then we simulate two cameras with the same intrinsic matrix: $f_x =800$, $c_x = 500$, $f_y = 800$, $c_y=400$, and $H=800$, $W=1000$. We rotate the first camera along the z-axis $10$ degrees and the second $-10$ degrees, and translate the first camera along the x-axis $2$ meters and the second $-2$ meters. Then the images of the chessboard in the two cameras are shown in the figure below.

original observe
The original images.

Then, we apply Gaussian blur on the observed images and add Gaussian noise on them. The new images are shown in the figure below.

blur observe
The simulated observed images.

After that, we use opencv “findChessboardCorners” and find the rough corners and then use “cornerSubPix” to refine the corners’ coordinates. The refined corners are shown in the figure below.

refined corners
The detected corners.

Pay attention to the sequence of the detected corners. (The detected corners start from the blue one and end with the green one in the above figure.) Rearrange the detected corners according to the refined chessboard corners. Using the “solvePnP” function with method “SQPNP”, which is a global optimization method, we can find the estimated extrinsic of the two cameras. We record the errors of the estimation using the original measurements of the chessboard corners and the refined ones.

CameraCornersRotation Error (degrees)Translation Error (meters)
Camera 1Original0.2600.0372
Camera 1Refined0.07320.0117
Camera 2Original0.2470.0318
Camera 2Refined0.07500.0118

The translation error is the norm of the difference between the estimated and the ground truth translation. The rotation error is defined as the norm of $\text{Log}(\mathbf{R}_1 \boxminus \mathbf{R}_2)$.

As shown in the table, the refined measurements result in lower errors for both rotation and translation, demonstrating the effectiveness of the corner refinement process.

In the final step, we do jointly optimization. We warp the image in camera 1 with parameters before and after jointly optimization. The results are shown in the figure below.

The warped images using before and after jointly optimization
The warped images (blended with the image in camera 2) before and after jointly optimization.

In the above figure, we can hardly tell which one is better. The translation error is $0.0298$ and $0.0257$ meters for camera 1 and camera 2 respectively. The rotation error is $0.204$ and $0.199$ degrees for camera 1 and camera 2 respectively. That is expected increased error for single camera calibration however more consistent for the two cameras.

Detect Corners on the Warped Image

Corner detection often becomes challenging on original images due to severe distortion or extreme viewing angles. To address this limitation, we can create a virtual camera with an optimized viewpoint. By warping the original image onto this virtual camera’s image plane, we can simulate a more frontal view of the chessboard. This transformation adjusts the perspective so that the chessboard appears as if viewed head-on, significantly improving corner detection reliability. We can then accurately detect the corners on this warped image rather than struggling with the difficult perspective of the original.

The figure below illustrates this approach. On the left is the original image where the severe viewing angle prevents the standard algorithm from successfully detecting the corners. On the right is the perspective-corrected warped image, where we can successfully identify and locate all corners.

The original and warped images
Left shows the original observed image. Right shows the warped image and the detected corners.