In this post, we will discuss the perspective-n-point (PnP) problem. We will start with the problem definition. Then, gradient-based optimization methods will be introduced. Finally, we will discuss two global optimization methods.
Problem Formulation
The core task of the Perspective-n-Point (PnP) problem is to determine the pose—specifically, the rotation and translation—of a calibrated camera in 3D space. This is achieved by using a set of known 3D points in the world and their corresponding 2D projections observed on the camera’s image sensor.
Goal
Given a set of n 3D points and their 2D image projections, the goal is to compute the rotation R and translation t of the camera, which together define its pose. This pose describes the transformation from the world coordinate system to the camera’s coordinate system.
Knowns (Inputs)
A set of n 3D points in the world coordinate frame:
$$ P_i = (X_i, Y_i, Z_i)^T \quad \text{for } i = 1, \dots, n $$Corresponding 2D projections of these points on the image plane:
$$ p_i = (u_i, v_i)^T \quad \text{for } i = 1, \dots, n $$The camera intrinsic matrix K, which is known because the camera is assumed to be calibrated. For modern cameras, we can assume there is no skew, so the matrix simplifies to:
$$ K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} $$where $(f_x, f_y)$ are the focal lengths in pixels and $(c_x, c_y)$ is the principal point.
Unknowns (Outputs)
- The rotation matrix R (a 3x3 matrix), which defines the camera’s orientation.
- The translation vector t (a 3x1 vector), which defines the camera’s position.
Mathematical Formulation
The relationship between a 3D world point $P_i$ and its 2D image projection $p_i$ is described by the pinhole camera model:
$$ s_i \begin{pmatrix} u_i \\ v_i \\ 1 \end{pmatrix} = K [R | t] \begin{pmatrix} X_i \\ Y_i \\ Z_i \\ 1 \end{pmatrix} = K (R P_i + t) $$Here, $s_i$ is a non-zero scale factor representing the depth of the point in the camera’s coordinate frame. To see how $s_i$ is determined, let’s expand the term $R P_i + t$. If we denote the rows of the rotation matrix $\mathbf{R}$ as $r_1^T, r_2^T, r_3^T$, then the transformed point in the camera frame is:
$$ R P_i + t = \begin{pmatrix} r_1^T P_i + t_1 \\ r_2^T P_i + t_2 \\ r_3^T P_i + t_3 \end{pmatrix} $$From the projection equation, the depth $s_i$ is the third component of this vector:
$$ s_i = r_3^T P_i + t_3 $$Crucially, this shows that $s_i$ is not an independent variable to be solved for; it is implicitly determined by the pose $(\mathbf{R}, \mathbf{t})$ and the 3D point $P_i$.
The PnP problem is typically formulated as a non-linear optimization problem. We seek the rotation $\mathbf{R}$ and translation $\mathbf{t}$ that minimize the reprojection error. This error is the sum of squared distances between the observed 2D points $p_i$ and the projected 2D points $\hat{p}_i$, which are calculated by projecting the 3D points $P_i$ using the estimated pose $(\mathbf{R}, \mathbf{t})$.
The objective function to minimize is:
$$ \underset{\mathbf{R}, \mathbf{t}}{\text{argmin}} \sum_{i=1}^{n} \left\| p_i - \hat{p}_i(\mathbf{R}, \mathbf{t}, P_i) \right\|^2 $$subject to the constraint that $\mathbf{R}$ is a valid rotation matrix:
$$ \mathbf{R}^T \mathbf{R} = I, \quad \det(\mathbf{R}) = 1 $$Gradient-Based Optimization Methods
Since the PnP objective function is a non-linear least squares problem, it is well-suited to be solved with iterative optimization methods like Gauss-Newton or Levenberg-Marquardt. These algorithms start with an initial guess for the pose $(\mathbf{R}, \mathbf{t})$ and iteratively refine it by solving a linearized version of the problem at each step until the solution converges.
The core of this process is to compute the Jacobian of the residual function with respect to the pose parameters. The residual for a single point $i$ is the reprojection error:
$$ r_i(\mathbf{R}, \mathbf{t}) = p_i - \hat{p}_i(\mathbf{R}, \mathbf{t}, P_i) $$We need to find the derivative of this residual to update our pose estimate.
The Challenge with Differentiating Rotation
A standard gradient cannot be directly computed for the rotation matrix $\mathbf{R}$. This is because rotation matrices belong to a special mathematical group called the special orthogonal group SO(3). This is a manifold, not a simple vector space, so the rules of standard calculus don’t apply. Adding two rotation matrices, for example, does not produce another rotation matrix.
To solve this, we optimize over a minimal 6-dimensional representation of the camera pose. This is done by parameterizing the updates in the Lie algebra associated with the pose group.
Lie Group SE(3): The camera pose (rotation $\mathbf{R}$ and translation $\mathbf{t}$) belongs to the special Euclidean group SE(3).
- $$
\xi = (\omega^\top, v^\top)^T
$$
Here, $\omega$ is a 3-vector representing an infinitesimal rotation, and $v$ is a 3-vector representing an infinitesimal translation.
Perturbation: Instead of updating $\mathbf{R}$ and $\mathbf{t}$ directly, we apply a small perturbation $\delta\xi$ to the current pose estimate. The pose is updated using the exponential map, which maps an element from the Lie algebra back to the Lie group:
$$ \mathbf{T}_{\text{new}} = \exp(\delta\xi^\wedge) \mathbf{T}_{\text{old}} $$where $\mathbf{T}$ is the 4x4 transformation matrix and $\delta\xi^\wedge$ is the matrix representation of the Lie algebra vector.
This allows us to perform the optimization in a standard vector space (the 6D space of $\delta\xi$) while ensuring that the updated rotation matrix $\mathbf{R}$ remains a valid element of SO(3).
Deriving the Jacobian
Our goal is to compute the Jacobian of the residual $r_i$ with respect to the 6-vector perturbation $\delta\xi = (\delta\omega^\top, \delta v^\top)^\top$, where $\delta\omega$ is the rotational part and $\delta v$ is the translational part. We apply the chain rule.
First, let $P'_i = \mathbf{R} P_i + \mathbf{t}$ be the 3D point $P_i$ transformed into the camera’s coordinate frame. The projection function is:
$$ \hat{p}_i = \text{proj}(P'_i) = \begin{pmatrix} f_x \frac{X'_i}{Z'_i} + c_x \\ f_y \frac{Y'_i}{Z'_i} + c_y \end{pmatrix} $$The chain rule gives us:
$$ J_i = \frac{\partial r_i}{\partial \delta\xi} = - \frac{\partial \hat{p}_i}{\partial P'_i} \frac{\partial P'_i}{\partial \delta\xi} $$This breaks the problem into three parts:
1. Jacobian of the Projection Function
This is the derivative of the projection function with respect to the 3D point in the camera frame $P'_i = (X'_i, Y'_i, Z'_i)^T$. This is a standard 2x3 matrix, which remains unchanged:
$$ \begin{aligned} \frac{\partial \hat{p}_i}{\partial P'_i} &= \begin{bmatrix} \frac{\partial u_i}{\partial X'_i} & \frac{\partial u_i}{\partial Y'_i} & \frac{\partial u_i}{\partial Z'_i} \\ \frac{\partial v_i}{\partial X'_i} & \frac{\partial v_i}{\partial Y'_i} & \frac{\partial v_i}{\partial Z'_i} \end{bmatrix} \\ &= \begin{bmatrix} \frac{f_x}{Z'_i} & 0 & -\frac{f_x X'_i}{(Z'_i)^2} \\ 0 & \frac{f_y}{Z'_i} & -\frac{f_y Y'_i}{(Z'_i)^2} \end{bmatrix} \end{aligned} $$2. Derivation of the Transformation Jacobian
This section provides a step-by-step derivation of how a small pose perturbation affects a 3D point in the camera’s coordinate frame, leading to the Jacobian of the transformation.
A camera’s pose is a rigid body transformation, which can be represented by a 4x4 matrix $\mathbf{T} \in SE(3)$. This matrix transforms a 3D point $P_i$ from world coordinates to camera coordinates, $P'_i$.
$$ \begin{pmatrix} P'_i \\ 1 \end{pmatrix} = \mathbf{T} \begin{pmatrix} P_i \\ 1 \end{pmatrix} $$We apply a small perturbation to this pose. This “wiggle” is represented by a 6D vector $\delta\xi = (\delta\omega^\top, \delta v^\top)^\top$ from the Lie algebra se(3). To apply this to the pose, we use the exponential map:
$$ \mathbf{T}_{\text{new}} = \exp(\delta\xi^\wedge) \mathbf{T}_{\text{old}} $$For a very small perturbation, we can use the first-order Taylor approximation of the exponential map: $\exp(\delta\xi^\wedge) \approx \mathbf{I} + \delta\xi^\wedge$. This is the key step that linearizes the problem.
$$ \begin{aligned} \begin{pmatrix} P'_{i, \text{new}} \\ 1 \end{pmatrix} &= \mathbf{T}_{\text{new}} \begin{pmatrix} P_i \\ 1 \end{pmatrix} \\ &\approx (\mathbf{I} + \delta\xi^\wedge) \mathbf{T}_{\text{old}} \begin{pmatrix} P_i \\ 1 \end{pmatrix} \end{aligned} $$We can recognize that $\mathbf{T}_{\text{old}} \begin{pmatrix} P_i \\ 1 \end{pmatrix}$ is simply the homogeneous representation of the point already in the camera frame, $\begin{pmatrix} P'_i \\ 1 \end{pmatrix}$. This allows for a crucial substitution:
$$ \begin{pmatrix} P'_{i, \text{new}} \\ 1 \end{pmatrix} \approx (\mathbf{I} + \delta\xi^\wedge) \begin{pmatrix} P'_i \\ 1 \end{pmatrix} $$This shows that we can calculate the new point by applying the perturbation directly to the old point in the camera frame. Let’s expand the matrix multiplication. The 4x4 matrix form of the perturbation is $\delta\xi^\wedge = \begin{pmatrix} \delta\omega^\wedge & \delta v \\ \mathbf{0} & 0 \end{pmatrix}$.
$$ \begin{aligned} \begin{pmatrix} P'_{i, \text{new}} \\ 1 \end{pmatrix} &\approx \begin{pmatrix} \mathbf{I} + \delta\omega^\wedge & \delta v \\ \mathbf{0} & 1 \end{pmatrix} \begin{pmatrix} P'_i \\ 1 \end{pmatrix} \\ &= \begin{pmatrix} (\mathbf{I} + \delta\omega^\wedge)P'_i + \delta v \\ 1 \end{pmatrix} \end{aligned} $$Converting back to non-homogeneous 3D coordinates, we get the final expression for the perturbed point:
$$ P'_{i, \text{new}} \approx (\mathbf{I} + \delta\omega^\wedge)P'_i + \delta v = P'_i + \delta\omega^\wedge P'_i + \delta v $$Using the cross-product identity $\delta\omega^\wedge P'_i = -(P'_i)^\wedge \delta\omega$, we arrive at the equation used for the derivative calculation:
$$ P'_{i, \text{new}} \approx P'_i - (P'_i)^\wedge \delta\omega + \delta v $$From this equation, we can directly compute the partial derivatives with respect to the perturbation components $\delta\omega$ and $\delta v$, which form the columns of our 3x6 Jacobian matrix.
3. The Final Jacobian Matrices
Combining these parts gives the final Jacobians. The Jacobian of the transformation is now constructed with the rotation component first:
$$ \frac{\partial P'_i}{\partial \delta\xi} = \begin{bmatrix} \frac{\partial P'_{i}}{\partial \delta\omega} & \frac{\partial P'_{i}}{\partial \delta v} \end{bmatrix}= \begin{bmatrix} -(P'_i)^\wedge & \mathbf{I} \end{bmatrix}= \begin{bmatrix} 0 & Z'_i & -Y'_i & 1 & 0 & 0 \\ -Z'_i & 0 & X'_i & 0 & 1 & 0 \\ Y'_i & -X'_i & 0 & 0 & 0 & 1 \end{bmatrix} $$And the complete Jacobian of the residual for a single point is:
$$ J_i = - \frac{\partial \hat{p}_i}{\partial P'_i} \frac{\partial P'_i}{\partial \delta\xi} $$The Optimization Loop
There are various gradient-based optimization methods. Here we will introduce the Gauss-Newton algorithm.
The Gauss-Newton algorithm proceeds as follows:
Initialize: Start with an initial guess for the pose $(\mathbf{R}_0, \mathbf{t}_0)$.
Iterate: For each iteration $k$:
a. For every point $i$, compute the residual $r_i(\mathbf{R}_k, \mathbf{t}_k)$ and the Jacobian $J_i$.
b. Construct the overall linear system:
$$ \left( \sum_i J_i^T J_i \right) \delta\xi = - \sum_i J_i^T r_i $$c. Solve for the 6D update vector $\delta\xi = (\delta\omega^\top, \delta v^\top)^\top$.
d. Update the pose using the exponential map:
$$ \mathbf{R}_{k+1} = \exp(\delta\omega^\wedge) \mathbf{R}_k \\ \mathbf{t}_{k+1} = \mathbf{t}_k + \delta v $$Terminate: Repeat until the update $\delta\xi$ is very small or a maximum number of iterations is reached.
Global Optimization Methods
While gradient-based methods are effective, they can get stuck in local minima, especially if the initial guess is poor. Global optimization methods aim to find the single best solution worldwide by reformulating the problem in a way that avoids this issue.
The DLS Method (Zheng et al., 2013 [1])
Another foundational approach to global optimization is the Direct Least Squares (DLS) method [1]. This technique tackles the problem by directly finding all the stationary points of the objective function, one of which must be the global minimum.
- KKT Conditions: The method starts with the Karush-Kuhn-Tucker (KKT) conditions, which are the necessary conditions for a solution to be optimal in a constrained optimization problem.
- Polynomial System: The KKT conditions are transformed into a system of multivariate polynomial equations.
- Gröbner Basis Solvers: A powerful algebraic geometry tool, the Gröbner basis, is used to solve this system of polynomial equations. This yields a finite number of candidate solutions (stationary points).
- Global Minimum Selection: Finally, the algorithm evaluates the original objective function for each of these candidate solutions and selects the one with the lowest reprojection error, which is guaranteed to be the global optimum.
The DLS method provides a strong theoretical guarantee of optimality and has inspired numerous follow-up works that have improved its speed and stability.
The SQPnP Method (Terzakis & Lourakis, 2020 [2])
One of the leading modern approaches is the SQPnP (Sequential Quadratic Programming for PnP) algorithm [2]. This method offers a consistently fast and globally optimal solution. Its core innovations are:
- Quadratic Programming Formulation: The PnP problem is reformulated as a non-linear quadratic program. The objective function is designed to be quadratic, and the rotation constraint ($\mathbf{R}^T \mathbf{R} = I$) is also expressed using quadratic equations.
- Parameter Space Decomposition: The algorithm cleverly divides the parameter space of possible rotations into a finite number of regions. It provides a guarantee that at least one of these regions contains the global minimum.
- Sequential Quadratic Programming: Within each of these identified regions, a fast and efficient Sequential Quadratic Programming (SQP) scheme is used to find the unique regional minimum.
- Global Optimum Guarantee: By comparing the minima found in each region, the algorithm can identify the true global minimum with certainty.
A key advantage of SQPnP is its efficiency and robustness. It works for any number of points ($n \ge 3$), is not affected by coplanar point arrangements, and has been integrated into popular libraries like OpenCV, making it a go-to choice for practical applications.
References
[1] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi, “Revisiting the PnP problem: A fast, general and optimal solution,” in 2013 IEEE International Conference on Computer Vision, Sydney, Australia: IEEE, Dec. 2013, pp. 2344–2351. doi: 10.1109/ICCV.2013.291.
[2] G. Terzakis, M. Lourakis, “A consistently fast and globally optimal solution to the perspective-n-point problem,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, 2020, pp. 478–494.