In this post, we will discuss the Kalman filters from two perspectives: classical parameter estimation and Bayesian estimation. Each perspective provides a unique way to derive the Kalman filter. Parameter estimation is more flexible as it allows you to easily add constraints, revise the transition and observation models, and derive other related smoothing and filtering methods. Meanwhile, Bayesian estimation is more intuitive and provides a clearer probabilistic interpretation, helping us understand the underlying principles better. By examining both approaches, we can gain a more comprehensive understanding of how Kalman filters work.

Kalman Filter

In this section, we will give an introduction to the Kalman filter.

State Transition and Observation Model

Markovian process of the signal.
Graphical model of the state transition and observation models.

Suppose we have a linear system with the state transition model:

$$ \mathbf{x}_{k} = \mathbf{F}_{k} \mathbf{x}_{k-1} + \mathbf{G}_k\mathbf{u}_k +\mathbf{w}_{k}, \tag{1} $$

where $\mathbf{x}_k \in \mathbb{R}^{n}$ is the state vector, $\mathbf{F}_k \in \mathbb{R}^{n \times n}$ is the state transition matrix, $\mathbf{G}_k \in \mathbb{R}^{n \times p}$ is the control input matrix, $\mathbf{u}_k \in \mathbb{R}^{p}$ is the control input, and $\mathbf{w}_{k} \in \mathbb{R}^{n}$ is the process noise. The process noise is Gaussian distributed with zero mean and covariance matrix $\mathbf{Q}_{k}$.

The measurement model is given by:

$$ \mathbf{y}_k = \mathbf{H}_k \mathbf{x}_k + \mathbf{v}_k, \tag{2} $$

where $\mathbf{y}_k \in \mathbb{R}^q$ is the measurement vector, $\mathbf{H}_k \in \mathbb{R}^{q \times n}$ is the measurement matrix, and $\mathbf{v}_k \in \mathbb{R}^q$ is the measurement noise, which is Gaussian distributed with zero mean and covariance matrix $\mathbf{R}_k$.

Predicte Step

In the prediction step, we predict the next step’s state, $\hat{\mathbf{x}}_{k|k-1}$ and its covariance matrix, $\hat{\mathbf{P}}_{k|k-1}$ based on the previous estimated state, $\hat{\mathbf{x}}_{k-1}$, the estimated covariance matrix, $\hat{\mathbf{P}}_{k-1}$, and the control input, $\mathbf{u}_{k-1}$.

The predicted state is given by:

$$ \hat{\mathbf{x}}_{k|k-1} = \mathbf{F}_{k} \hat{\mathbf{x}}_{k-1} + \mathbf{G}_k \mathbf{u}_{k-1}, $$

and the predicted covariance matrix is given by:

$$ \mathbf{P}_{k|k-1} = \mathbf{F}_{k} \mathbf{P}_{k-1} \mathbf{F}_{k}^{\top} + \mathbf{Q}_{k-1}. $$

Update Step

In the update step, we update the estimation of the state and covariance given the measurement, $\mathbf{y}_k$.

The update step for the gain matrix is given by:

$$ \mathbf{K}_k = \mathbf{P}_{k|k-1} \mathbf{H}^\top_k\left(\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top+\mathbf{R}_k\right)^{-1}, $$

The updated the state is given by:

$$ \hat{\mathbf{x}}_{k} = \hat{\mathbf{x}}_{k|k-1} + \mathbf{K}_k(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k|k-1}), $$

and the updated covariance matrix is:

$$ \mathbf{P}_{k} = (\mathbf{I} - \mathbf{K}_k\mathbf{H}_k)\mathbf{P}_{k|k-1}. $$

This is the standard (Gain-based) updating form.

An Alternative of the Update Step

In the above update step, we can rewrite the gain matrix as:

$$ \mathbf{K}_k = \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)^{-1}\mathbf{H}_k^\top \mathbf{R}_k^{-1}, $$

this is due to the fact that:

$$ \begin{aligned} \mathbf{P}_{k|k-1} \mathbf{H}^\top_k\left(\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top+\mathbf{R}_k\right)^{-1} &= \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)^{-1}\mathbf{H}_k^\top \mathbf{R}_k^{-1} \\ &\Leftrightarrow \quad \\ \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)\mathbf{P}_{k|k-1} \mathbf{H}^\top_k &= \mathbf{H}_k^\top \mathbf{R}_k^{-1} \left(\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top+\mathbf{R}_k\right) \\ &\Leftrightarrow \quad \\ \mathbf{H}_k^\top\mathbf{R}_k^{-1}\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top + \mathbf{H}_k^\top &= \mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top + \mathbf{H}_k^\top. \end{aligned} $$

And,

$$ \begin{aligned} \left(\mathbf{I} - \mathbf{K}_k\mathbf{H}_k \right)\mathbf{P}_{k|k-1} &= \mathbf{P}_{k|k-1} - \mathbf{P}_{k|k-1}\mathbf{H}_k^\top\left(\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\mathbf{H}_k\mathbf{P}_{k|k-1} \\ &= \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)^{-1} \end{aligned}. $$

In the second equation, we use the Woodbury identity formula.

So, the two update steps are:

$$ \begin{aligned} \mathbf{P}_{k} &= \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)^{-1}, \\ \hat{\mathbf{x}}_{k} &= \hat{\mathbf{x}}_{k|k-1} + \mathbf{P}_k\mathbf{H}_k^\top\mathbf{R}_k^{-1}(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k|k-1}). \end{aligned} $$

This kind of update step is called information (inverse-covariance) form.

Practical Consideration

Standard v.s. Information Form

Which form is better? Mathematically, they are equivalent. The standard form requires inverting a $q \times q$ matrix, while the information form requires inverting an $n \times n$ matrix. If $n$ and $q$ are comparable, both forms are equally suitable. However, when one dimension is significantly larger than the other, it’s better to choose the form that requires inverting the smaller matrix to reduce computational complexity.

Tracking

Kalman filter is widely used in tracking problems, where it provides an efficient way to predict the current state given the previous observations. A very important step in tracking problem is the association step, where given the state prediction and current observation, we need to decide whether the observation is coming from the tracked object.

Let $\mathbf{\nu}$ denotes the residual, i.e.,

$$ \mathbf{\nu} = \mathbf{y}_k - \mathbf{H}_k\mathbf{x}_{k|k-1}. $$

Then, if $\mathbf{y}_k$ is coming from the tracked object, is follows $\mathcal{N}(\mathbf{0}, \mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top+\mathbf{R}_k)$. Then, we are handling a hypothesis testing problem: whether $\mathbf{v}$ coming from the desired distribution, $\mathcal{N}(\mathbf{0}, \mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k^\top+\mathbf{R}_k)$.

We choose $s^2= \mathbf{\nu}^\top \mathbf{S}_k \mathbf{\nu}$ as the statistics, where $\mathbf{S}_k = \left(\mathbf{H}_k\mathbf{P}_{k|k-1}\mathbf{H}_k + \mathbf{R}_k\right)^{-1}$. Then, $s^2$ follows a Chi-squre distribution and $s$ is called the Mahalanobis distance.

We choose a threshold $\gamma$ so that $P(s^2 \le \gamma) = 0.95$ to accept the Hypothesis that $\mathbf{y}_k$ is coming from the tracked object.

Parameter Estimation

In this section, we will explain Kalman filter from the perspective of parameter estimation. First, we will introduce the minimum variance linear unbiased estimator. Then, we will show how to formulate Kalman filter as the minimum variance linear unbiased estimation problem. The materials in this section are from [4]. You can find more interesting results in [4], e.g., the extended Kalman filter, predictive estimates, smoothed estimates, fading memory, fixed-lag smoothing, etc.

The Minimum Variance Linear Unbiased Estimator

In the parameter estimation problem, we treat the parameter as deterministic.

Suppose we have a linear observation system:

$$ \mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{v}, $$

where $\mathbf{v}$ is zero-mean with covariance matrix $\mathbf{R}$, $\mathbf{x}$ is the deterministic parameter to be estimated, and $\mathbf{y}$ is the measurement data. We are seeking to find a linear unbiased estimator, which is a linear function of the observation to estimate the parameter,

$$ \hat{\mathbf{x}} = \mathbf{K}\mathbf{y}, $$

that minimizes the mean square error,

$$ \min_{\mathbf{K}} \mathbb{E} \| \mathbf{x} - \hat{\mathbf{x}}\|^2. $$

Recall an unbiased estimator means $\mathbb{E}[\mathbf{K}\mathbf{y}] = \mathbf{x}$. Substitute $\mathbf{y}=\mathbf{H}\mathbf{x}+\mathbf{v}$ into the above equation, we have:

$$ \mathbb{E}[\mathbf{K}\mathbf{y}] = \mathbb{E}[\mathbf{K}(\mathbf{H}\mathbf{x}+\mathbf{v})] = \mathbf{K}\mathbf{H}\mathbf{x} + \mathbb{E}[\mathbf{K}\mathbf{v}] = \mathbf{x}. $$

So, we must have $\mathbf{K}\mathbf{H} = \mathbf{I}$.

Substituting $\hat{\mathbf{\mathbf{x}}}$ into the objective function, we have:

$$ \begin{aligned} \mathbb{E} \| \mathbf{x} - \hat{\mathbf{x}}\|^2 &= \mathbb{E} \| \mathbf{x} - \mathbf{K} \mathbf{y} \|^2 \\ &= \mathbb{E} \| \mathbf{x} - \mathbf{K} (\mathbf{H} \mathbf{x} + \mathbf{v}) \|^2 \\ &= \mathbb{E} \| (\mathbf{I} - \mathbf{K} \mathbf{H}) \mathbf{x} - \mathbf{K} \mathbf{v} \|^2 \\ &= \mathbb{E}\left[ \mathbf{x}^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H})^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H}) \mathbf{x} \right] + \mathbb{E}\left[ \mathbf{v}^{\top} \mathbf{K}^{\top} \mathbf{K} \mathbf{v} \right] \\ &\quad + \mathbb{E}\left[ \mathbf{x}^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H})^{\top} \mathbf{K} \mathbf{v} \right] + \mathbb{E}\left[ \mathbf{v}^{\top} \mathbf{K}^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H}) \mathbf{x} \right] \\ &= \mathbb{E}\left[ \mathbf{x}^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H})^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H}) \mathbf{x} \right] + \mathbb{E}\left[ \mathbf{v}^{\top} \mathbf{K}^{\top} \mathbf{K} \mathbf{v} \right] \\ &= \mathbb{E}\left[ \mathbf{x}^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H})^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H}) \mathbf{x} \right] + \text{tr}(\mathbf{K}^\top \mathbf{K}\mathbf{R}) \\ &=\mathbb{E}\left[ \mathbf{x}^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H})^{\top} (\mathbf{I} - \mathbf{K} \mathbf{H}) \mathbf{x} \right] + \text{tr}(\mathbf{K}\mathbf{R}\mathbf{K}^\top) \end{aligned} $$

Since $\mathbf{K}\mathbf{H} = \mathbf{I}$, the linear minimum variance unbiased estimator is equivalent to the following optimization problem:

$$ \min_{\mathbf{K}} \text{tr}(\mathbf{K}\mathbf{R}\mathbf{K}^\top), \quad \text{s.t.} \quad \mathbf{K}\mathbf{H} = \mathbf{I}. $$

Write out the Lagrangian of this equality-constrained optimization problem:

$$ \mathcal{L}(\mathbf{K}, \mathbf{U}) = \text{tr}(\mathbf{K}\mathbf{R}\mathbf{K}^\top) + \text{tr}(\mathbf{U}^\top (\mathbf{K}\mathbf{H} - \mathbf{I})). $$

Take the derivative of the Lagrangian with respect to $\mathbf{K}$, set it to zero, and consider the equality constraint, we have [2]:

$$ \begin{cases} \frac{\partial \mathcal{L}}{\partial \mathbf{K}} = \mathbf{K}\mathbf{R} + \mathbf{K}\mathbf{R}^\top + \mathbf{U}\mathbf{H}^\top = 0, \\ \mathbf{K}\mathbf{H} - \mathbf{I} = 0. \end{cases} $$

when $\mathbf{R}$ is symmetric, we have:

$$ \begin{cases} 2\mathbf{K}\mathbf{R} + \mathbf{U}\mathbf{H}^\top = 0, \\ \mathbf{K}\mathbf{H} - \mathbf{I} = 0. \end{cases} $$

From the first equation, we have $\mathbf{K}=-\frac{1}{2}\mathbf{U}\mathbf{H}^\top\mathbf{R}^{-1}$. Substituting $\mathbf{K}$ into the second equation, we get:

$$ -\frac{1}{2}\mathbf{U}\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H} = \mathbf{I}. $$

Assume $\mathbf{H}$ is full column rank, we get

$$ \mathbf{U} = -2(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H})^{-1} $$

Plugging $\mathbf{U}$ in the first equation and solve $\mathbf{K}$, we get

$$ \mathbf{K}^* = (\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H})^{-1}\mathbf{H}^\top\mathbf{R}^{-1}. $$

The Hessian matrix of the objective is $2\mathbf{R}$, which is positive definite. Thus, $\mathbf{K}^*$ is the global minimum.

The error covariance is

$$ \begin{aligned} \mathbb{E}\left[(\mathbf{x}-\mathbf{K}^*\mathbf{y})(\mathbf{x}-\mathbf{K}^*\mathbf{y})^\top\right] &= \mathbf{K}^*(\mathbb{E}\mathbf{v}\mathbf{v}^\top)\mathbf{K}^{* \top} \\ &= \mathbf{K}^*\mathbf{R}\mathbf{K}^{* \top} \\ &= \left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}\right)^{-1} \end{aligned} $$

The minimum variance linear unbiased estimator is also the solution to the weighted least square problem:

$$ \min_{\mathbf{x}} \frac{1}{2}\| \mathbf{y} - \mathbf{H}\mathbf{x} \|^2_{\mathbf{R}^{-1}}, $$

where $\frac{1}{2}\| \mathbf{y} - \mathbf{H}\mathbf{x} \|^2_{\mathbf{R}^{-1}} = \frac{1}{2}(\mathbf{y} - \mathbf{H}\mathbf{x})^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{H}\mathbf{x})$.

Also, it is the solution to the maximum likelihood estimation:

$$ \max_{\mathbf{x}} \,\log p(\mathbf{y}; \mathbf{x}) \propto \frac{1}{2}(\mathbf{y} - \mathbf{H}\mathbf{x})^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{H}\mathbf{x}). $$

Note: in the above derivation, we assume

  1. it is a linear observation model;
  2. the noise has zero-mean;
  3. we seek for linear, unbiased estimator;
  4. the observation matrix is full column rank.

Consider the weighted least square problem objective function:

$$ J(\mathbf{x}) = \frac{1}{2}( \mathbf{y} - \mathbf{H}\mathbf{x} )^\top{\mathbf{R}^{-1}}(\mathbf{y} - \mathbf{H}\mathbf{x}). $$

Since $\mathbf{R}$ is the covariance matrix of the noise, it is positive definite, the minimum can be found by root finding on the gradient of the objective function:

$$ \nabla_{\mathbf{x}} J(\mathbf{x}) = \mathbf{H}^\top\mathbf{R}^{-1}(\mathbf{y} - \mathbf{H}\mathbf{x}), $$

which reduces to solving the normal equation:

$$ \mathbf{H}^\top\mathbf{R}^{-1}\mathbf{y} = \mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}\mathbf{x}. $$

The Hessian matrix of $J(\mathbf{x})$ is $\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}$, which is positive definite and is also the covariance matrix of the estimation error.

Since the objective is positive-definite quadratic form, a single iteration of the Newton’s method will converge to the global minimum irrespective of the initial guess, that is,

$$ \hat{\mathbf{x}} = \mathbf{x} - \nabla^2J(\mathbf{x})^{-1}\nabla J(\mathbf{x}) = \mathbf{x} - (\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H})^{-1}\mathbf{H}^\top\mathbf{R}^{-1}(\mathbf{y} - \mathbf{H}\mathbf{x}) $$

for all $\mathbf{x}$.

State Estimation

We assume that the initial state of the system is $\mathbf{x}_0= \boldsymbol{\mu}_0 + \mathbf{w}_0$ for some known $\boldsymbol{\mu}_0\in \mathbb{R}^n$. Suppose we have $m$ observations $\mathbf{y}_1, \mathbf{y}_2, \cdots, \mathbf{y}_m$, and $k$ known inputs, $\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_k$. Writing all the equations together, we have:

$$ \begin{aligned} \boldsymbol{\mu}_0 &= \mathbf{x}_0 &&\quad - \mathbf{w}_0, \\ \mathbf{G}_1 \mathbf{u}_1 &= \mathbf{x}_1 - \mathbf{F}_1 \mathbf{x}_0 &&\quad - \mathbf{w}_1, \\ \mathbf{y}_1 &= \mathbf{H}_1 \mathbf{x}_1 &&\quad + \mathbf{v}_1, \\ &\vdots &&\quad \vdots \\ \mathbf{G}_m \mathbf{u}_m &= \mathbf{x}_m - \mathbf{F}_m \mathbf{x}_{m-1} &&\quad - \mathbf{w}_m, \\ \mathbf{y}_m &= \mathbf{H}_m \mathbf{x}_m &&\quad + \mathbf{v}_m, \\ \mathbf{G}_{m+1} \mathbf{u}_{m+1} &= \mathbf{x}_{m+1} - \mathbf{F}_{m+1} \mathbf{x}_m &&\quad - \mathbf{w}_{m+1} \\ &\vdots &&\quad \vdots \\ \mathbf{G}_k \mathbf{u}_k &= \mathbf{x}_k - \mathbf{F}_k \mathbf{x}_{k-1} &&\quad - \mathbf{w}_k. \end{aligned} $$

The above large linear system can be written in a compact form:

$$ \mathbf{b}_{k|m} = \mathbf{A}_{k|m}\mathbf{z}_{k} +\boldsymbol{\varepsilon}_{k|m}, $$

where

$$ \mathbf{z}_k = \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{x}_1 \\ \vdots \\ \mathbf{x}_k \end{bmatrix} \in \mathbb{R}^{n(k+1)}, $$

the observation matrix is

$$ \mathbf{A}_{k|m} = \begin{bmatrix} \mathbf{I} & \mathbf{0} & \mathbf{0} &\cdots & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ -\mathbf{F}_1 & \mathbf{I} & \mathbf{0} &\cdots & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ \mathbf{0} &\mathbf{H}_1 & \mathbf{0} &\cdots & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & -\mathbf{F}_2 & \mathbf{I} &\cdots & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{0} & \mathbf{0} & \mathbf{0} &\cdots & -\mathbf{F}_k & \mathbf{I} & \mathbf{0} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} &\cdots & \mathbf{0} & \mathbf{H}_k &\mathbf{0} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} &\cdots & \mathbf{0} & \mathbf{0} & -\mathbf{F}_{k+1} &\mathbf{I}& \cdots &\mathbf{0}&\mathbf{0} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{0} & \mathbf{0} & \mathbf{0} &\cdots & \mathbf{0} & \mathbf{0} & \mathbf{0} &\cdots & \cdots & -\mathbf{F}_m & \mathbf{I} \\ \end{bmatrix}, $$

it is full column rank, and $\boldsymbol{\varepsilon}_{k|m}$ is the zero-mean variable with inverse covariance matrix as a block diagonal matrix:

$$ \mathbf{W}_{k|m} = \text{diag}(\mathbf{Q}_0^{-1}, \mathbf{Q}_1^{-1}, \mathbf{R}_1^{-1}, \dots, \mathbf{Q}_m^{-1}, \mathbf{R}_{m}^{-1}, \mathbf{Q}_{m+1}^{-1}, \dots, \mathbf{Q}_{k}^{-1}). $$

Thus, the best linear unbiased estimate $\hat{\mathbf{z}}_{k|m}$ of $\mathbf{z}_k$ is found by solving the normal equations

$$ \mathbf{A}_{k|m}^\top \mathbf{W}_{k|m} \mathbf{A}_{k|m} \hat{\mathbf{z}}_{k|m} = \mathbf{A}_{k|m}^\top \mathbf{W}_{k|m} \mathbf{b}_{k|m}. $$

Since each column of $\mathbf{A}_{k|m}$ is a pivot column, it follows that $\mathbf{A}_{k|m}$ is of full column rank, and thus $\mathbf{A}_{k|m}^\top \mathbf{W}_{k|m} \mathbf{A}_{k|m}$ is nonsingular—indeed, it is positive definite. Hence, the best linear unbiased estimator $\hat{\mathbf{z}}_{k|m}$ of $\mathbf{z}_k$ is given by the weighted least squares solution

$$ \hat{\mathbf{z}}_{k|m} = \left( \mathbf{A}_{k|m}^\top \mathbf{W}_{k|m} \mathbf{A}_{k|m} \right)^{-1} \mathbf{A}_{k|m}^\top \mathbf{W}_{k|m} \mathbf{b}_{k|m}. $$

The weighted least squares solution is the minimizer of the positive-definite objective function

$$ J_{k|m}(\mathbf{z}_k) = \frac{1}{2} \left\| \mathbf{A}_{k|m} \mathbf{z}_k - \mathbf{b}_{k|m} \right\|_{\mathbf{W}_{k|m}}^2, $$

which can also be written as the sum

$$ \begin{aligned} J_{k|m}(\mathbf{z}_k) &=\frac{1}{2}\|\mathbf{x}_0 - \boldsymbol{\mu}_0 \|_{\mathbf{Q}_0^{-1}}^2 + \frac{1}{2} \sum_{i=1}^m \left\| \mathbf{y}_i - \mathbf{H}_i \mathbf{x}_i \right\|_{\mathbf{R}_i^{-1}}^2 \\ &+ \frac{1}{2} \sum_{i=1}^k \left\| \mathbf{x}_i - \mathbf{F}_i \mathbf{x}_{i-1} - \mathbf{G}_i \mathbf{u}_i \right\|_{\mathbf{Q}_i^{-1}}^2, \end{aligned} $$

Two-Step Kalman Filter

Predict Step

Let us consider when $m=k$, i.e., that number of observations is equal to the number of input controls. Let $\hat{\mathbf{x}}_{k|k}$ denotes the updated state after the ($k$)th observation, and $\hat{\mathbf{x}}_{k|k-1}$ denotes the prediction after observing the ($k$)th control based on $\hat{\mathbf{x}}_{k-1}$.

The first step of the Kalman filter, called the predictive step, is to determine $\hat{\mathbf{x}}_{k|k-1}$ from $\hat{\mathbf{x}}_{k-1|k-1}$. For notational convenience throughout this paper, we denote $\hat{\mathbf{z}}_{k|k}$ as $\hat{\mathbf{z}}_k$, $\hat{\mathbf{x}}_{k|k}$ as $\hat{\mathbf{x}}_k$, and $J_{k|k}$ as $J_k$. In addition, we define $\mathcal{F}_k = [0 \ \cdots \ 0 \ \mathbf{F}_k] \in \mathbb{R}^{n \times kn}$, with the entry $\mathbf{F}_k$ lying in the block corresponding to $\mathbf{x}_{k-1}$, so that $\mathcal{F}_k z_{k-1} = \mathbf{F}_k x_{k-1}$. Using this notation, we may write $J_{k|k-1}$ recursively as

$$ J_{k|k-1}(\mathbf{z}_k) = J_{k-1}(\mathbf{z}_{k-1}) + \frac{1}{2} \left\| \mathbf{x}_k - \mathcal{F}_k \mathbf{z}_{k-1} - \mathbf{G}_k \mathbf{u}_k \right\|^2_{\mathbf{Q}_k^{-1}}. $$

The gradient and Hessian are given by

$$ \nabla J_{k|k-1}(\mathbf{z}_k) = \begin{bmatrix} \nabla J_{k-1}(\mathbf{z}_{k-1}) + \mathcal{F}_k^T \mathbf{Q}_k^{-1} \left( \mathcal{F}_k \mathbf{z}_{k-1} - \mathbf{x}_k + \mathbf{G}_k \mathbf{u}_k \right) \\ -\mathbf{Q}_k^{-1} \left( \mathcal{F}_k \mathbf{z}_{k-1} - \mathbf{x}_k + \mathbf{G}_k \mathbf{u}_k \right) \end{bmatrix} $$

and

$$ D^2 J_{k|k-1}(\mathbf{z}_k) = \begin{bmatrix} D^2 J_{k-1}(\mathbf{z}_{k-1}) + \mathcal{F}_k^\top\mathbf{Q}_k^{-1} \mathcal{F}_k & -\mathcal{F}_k^T \mathbf{Q}_k^{-1} \\ -\mathbf{Q}_k^{-1} \mathcal{F}_k & \mathbf{Q}_k^{-1} \end{bmatrix}, $$

respectively. Since $D^2 J_{k|k-1}$ is positive definite, a single iteration of Newton’s method yields the minimizer

$$ \hat{\mathbf{z}}_{k|k-1} = \mathbf{z}_k - D^2 J_{k|k-1}(\mathbf{z}_k)^{-1} \nabla J_{k|k-1}(\mathbf{z}_k) $$

for any $\mathbf{z}_k \in \mathbb{R}^{(k+1)n}$. Now, $\nabla J_{k-1}(\hat{\mathbf{z}}_{k|k-1}) = \mathbf{0}$ and $\mathcal{F}_k \hat{\mathbf{z}}_{k|k-1} = \mathbf{F}_k \hat{\mathbf{x}}_{k-1}$, so a judicious initial guess is

$$ \mathbf{z}_k = \begin{bmatrix} \hat{\mathbf{z}}_{k-1} \\ \mathbf{F}_k \hat{\mathbf{x}}_{k-1} + \mathbf{G}_k \mathbf{u}_k \end{bmatrix}. $$

With this starting point, $\mathcal{F}_k\mathbf{z}_{k-1}=\mathbf{F}_k\hat{\mathbf{x}}_{k-1}$, so that $\mathcal{F}_k\mathbf{z}_{k-1} - \mathbf{x}_k + \mathbf{G}_k\mathbf{u}_k=\mathbf{0}$ and the gradient reduces to $\nabla J_{k|k-1}(\mathbf{z}_k) = 0$, that is, the optimal estimate of $\mathbf{x}_k$ given the measurements $\mathbf{y}_1, \ldots, \mathbf{y}_{k-1}$ is the bottom row of $\hat{\mathbf{z}}_{k|k-1}$ and the covariance, $\mathbf{P}_{k|k-1}$, is the bottom right block of the inverse Hessian $D^2 J_{k|k-1}^{-1}$. The bottom right block of the inverse Hessian is

$$ \begin{aligned} &\left(\mathbf{Q}_k^{-1} - \mathbf{Q}_{k}^{-1}\mathcal{F}_k(D^2J_{k-1}^{-1} + \mathcal{F}_k^\top\mathbf{Q}_{k}^{-1}\mathcal{F}_k) \mathcal{F}_k \mathbf{Q}_k^{-1}\right)^{-1} \\ &= \mathbf{Q}_k + \mathcal{F}^\top_k D^2J_{k-1}^{-1}\mathcal{F}_k \\ &=\mathbf{Q}_k + \mathbf{F}_k^\top \mathbf{P}_{k-1}\mathbf{F}_k \end{aligned} $$

The first equality is the result of Woodbury identity.

So, the predict step of the Kalman filter is:

$$ \mathbf{P}_{k|k-1} = \mathbf{F}_k \mathbf{P}_{k-1} \mathbf{F}_k^T + \mathbf{Q}_k $$$$ \hat{\mathbf{x}}_{k|k-1} = \mathbf{F}_k \hat{\mathbf{x}}_{k-1} + \mathbf{G}_k \mathbf{u}_k $$

Update Step

After measuring the output $\mathbf{y}_k$, we perform the second step, which corrects the prior estimate-covariance pair $(\hat{\mathbf{x}}_{k|k-1}, \mathbf{P}_{k|k-1})$ giving a posterior estimate-covariance pair $(\hat{\mathbf{x}}_k, \mathbf{P}_k)$; this is called the update step. In analogy to $\mathcal{F}_k$, we introduce the notation $\mathcal{H}_k = [0 \ \cdots \ 0 \ \mathbf{H}_k] \in \mathbb{R}^{q \times n(k+1)}$, with the entry $\mathbf{H}_k$ lying in the block corresponding to $\mathbf{x}_k$, so that $\mathcal{H}_k \mathbf{z}_k = \mathbf{H}_k \mathbf{x}_k$. The objective function now becomes

$$ J_k(\mathbf{z}_k) = J_{k|k-1}(\mathbf{z}_k) + \frac{1}{2} \left\| \mathbf{y}_k - \mathcal{H}_k \mathbf{z}_k \right\|^2_{\mathbf{R}_k^{-1}} $$

and the gradient and Hessian are, respectively,

$$ \begin{aligned} \nabla J_k(\mathbf{z}_k) &= \nabla J_{k|k-1}(\mathbf{z}_k) + \mathcal{H}_k^T \mathbf{R}_k^{-1} (\mathcal{H}_k \mathbf{z}_k - \mathbf{y}_k)\\ &= \nabla J_{k|k-1}(\mathbf{z}_k) + \mathcal{H}_k^T \mathbf{R}_k^{-1} (\mathbf{H}_k \mathbf{x}_k - \mathbf{y}_k) \end{aligned} $$

and

$$ D^2 J_k(\mathbf{z}_k) = D^2 J_{k|k-1} + \mathcal{H}_k^T \mathbf{R}_k^{-1} \mathcal{H}_k. $$

The Hessian is clearly positive definite. Again, applying a single iteration of Newton’s method yields the minimizer. If we choose the initial guess $\mathbf{z}_k = \hat{\mathbf{z}}_{k|k-1}$, then, since $\nabla J_{k|k-1}(\hat{\mathbf{z}}_{k|k-1}) = \mathbf{0}$, the gradient becomes

$$ \nabla J_k(\hat{\mathbf{z}}_{k|k-1}) = \mathcal{H}_k^T \mathbf{R}_k^{-1} \left( \mathbf{H}_k \hat{\mathbf{x}}_{k|k-1} - \mathbf{y}_k \right). $$

The estimate $\hat{\mathbf{x}}_k$ of $\mathbf{x}_k$, together with its covariance $\mathbf{P}_k$, is then obtained from the bottom row of the Newton update and the bottom right block of the covariance matrix $D^2 J_k(\mathbf{z}_k)^{-1}$. Since

$$ \begin{aligned} D^2 J_k(\hat{\mathbf{z}}_{k|k-1})^{-1} &= \left( D^2 J_{k|k-1}(\hat{\mathbf{z}}_{k|k-1}) + \mathcal{H}_k^T \mathbf{R}_k^{-1} \mathcal{H}_k \right)^{-1} \\ &= D^2 J_{k|k-1}^{-1} - D^2 J_{k|k-1}^{-1} \mathcal{H}_k^T \left( \mathcal{H}_k D^2 J_{k|k-1}^{-1} \mathcal{H}_k^T + \mathbf{R}_k \right)^{-1} \mathcal{H}_k D^2 J_{k|k-1}^{-1}\\ &= D^2 J_{k|k-1}^{-1} - D^2 J_{k|k-1}^{-1} \mathcal{H}_k^T \left( \mathbf{H}_k \mathbf{P}_{k|k-1} \mathbf{H}_k^T + \mathbf{R}_k \right)^{-1} \mathcal{H}_k D^2 J_{k|k-1}^{-1}, \end{aligned} $$

Then, the lower right block of $D^2 J_k(\hat{\mathbf{z}}_{k|k-1})^{-1}$ is

$$ \mathbf{P}_{k|k-1} - \mathbf{P}_{k|k-1}\mathbf{H}_k^\top(\mathbf{H}_k \mathbf{P}_{k|k-1} \mathbf{H}_k^\top + \mathbf{R}_k)^{-1}\mathbf{H}_k \mathbf{P}_{k|k-1} \\ = \left( \mathbf{P}_{k|k-1}^{-1} + \mathbf{H}_k^\top\mathbf{R}_k^{-1}\mathbf{H}_k \right)^{-1} $$

In the second equation, we use the Woodbury identity formula.

So, the update step of the Kalman filter is:

$$ \mathbf{P}_k = \left( \mathbf{P}_{k|k-1}^{-1} + \mathbf{H}_k^T \mathbf{R}_k^{-1} \mathbf{H}_k \right)^{-1}, $$$$ \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k|k-1} - \mathbf{P}_k \mathbf{H}_k^T \mathbf{R}_k^{-1} \left( \mathbf{H}_k \hat{\mathbf{x}}_{k|k-1} - \mathbf{y}_k \right). $$

Bayesian Estimation

In this section, we will introduce the Bayesian estimation. For Bayesian estimation, there are four components: the prior distribution, the likelihood, the posterior distribution, and the loss function.

General Case

From a Bayesian perspective, filtering means to quantify a degree of belief in the state $\mathbf{x}_k$ at time $k$, given all the data up to time $k$ ($\mathbf{Y}_k$) in a recursive (sequential) manner. i.e. to construct the posterior $p(\mathbf{x}_k|\mathbf{Y}_k)$. We do this in 2 steps: 1. Prediction: uses the state model to predict the belief of state at time $k$, using $\mathbf{Y}_{k-1}$. 2. Update: At time $k$ when measurement $\mathbf{y}_k$ becomes available, we will update the prediction.

In the step of prediction where we have a previous belief $p(\mathbf{x}_{k-1}|\mathbf{Y}_{k-1})$ and we want to know what can be predicted about $\mathbf{x}_k$, i.e. we want to find $p(\mathbf{x}_k|\mathbf{Y}_{k-1})$.

We use the Chapman-Kolmogorov equation:

$$ \begin{aligned} p(\mathbf{x}_k|\mathbf{Y}_{k-1}) &= \int p(\mathbf{x}_k, \mathbf{x}_{k-1}|\mathbf{Y}_{k-1}) d\mathbf{x}_{k-1} \\ &= \int p(\mathbf{x}_k|\mathbf{x}_{k-1}, \mathbf{Y}_{k-1}) p(\mathbf{x}_{k-1}|\mathbf{Y}_{k-1}) d\mathbf{x}_{k-1} \end{aligned} $$

Assuming the state at time $k$ is only dependent on the state at time $k-1$ and is independent of the observation history $\mathbf{Y}_{k-1}$ when $\mathbf{x}_{k-1}$ is given. In the above equation $p(\mathbf{x}_k|\mathbf{x}_{k-1}, \mathbf{Y}_{k-1})$ is derived from the state equation.

The step of update uses new measurement $\mathbf{y}_k$ to construct the posterior $p(\mathbf{x}_k|\mathbf{Y}_k)$. The update or correction is carried out via the Bayes rule.

$$ \begin{aligned} p(\mathbf{x}_k|\mathbf{Y}_k) &= p(\mathbf{x}_k|\mathbf{y}_k, \mathbf{Y}_{k-1}) \\ &= \frac{p(\mathbf{y}_k|\mathbf{x}_k, \mathbf{Y}_{k-1}) p(\mathbf{x}_k|\mathbf{Y}_{k-1})}{p(\mathbf{y}_k|\mathbf{Y}_{k-1})} \\ &= \frac{p(\mathbf{y}_k|\mathbf{x}_k) p(\mathbf{x}_k|\mathbf{Y}_{k-1})}{p(\mathbf{y}_k|\mathbf{Y}_{k-1})} \end{aligned} $$

Once we get the posterior distribution, we can find the optimal estimation by minimizing the loss function. Common loss functions include the mean square error, the mean absolute error, and the $0-1$ loss. That is:

$$ MSE = \min_{\hat{\mathbf{x}}} \mathbb{E}_{p(\mathbf{X})} \| \mathbf{x} - \hat{\mathbf{x}}\|^2 \\ $$$$ MAE = \min_{\hat{\mathbf{x}}} \mathbb{E}_{p(\mathbf{X})} \| \mathbf{x} - \hat{\mathbf{x}}\| \\ $$$$ \min_{\hat{\mathbf{x}}} \mathbb{E}_{p(\mathbf{X})} L(\mathbf{x}, \hat{\mathbf{x}}) $$

where the $0-1$ loss function is defined as:

$$ L(\mathbf{x}, \hat{\mathbf{x}}) = \begin{cases} 0, & \text{if } |\mathbf{x} - \hat{\mathbf{x}}| \leq \delta \\ 1, & \text{if } |\mathbf{x} - \hat{\mathbf{x}}| > \delta \end{cases} $$

The MSE loss, MAE loss, and the $0-1$ loss lead to the mean, median, and mode of the posterior distribution, respectively. In the special case that the posterior distribution is Gaussian, the mean, median, and mode are coincident.

Kalman filter assumes the process noise and measurement noise are Gaussian distributed. Thus, if we give initial belief of the state is Gaussian, everything goes into Gaussian. Note that, in the minimum variance linear unbiased estimator, we do not make any assumption on the distribution of the process noise and measurement noise, only the first and second moment information is required.

Kalman Filter as a Bayesian Estimator

Assume the posterior of the state at time $k-1$ is Gaussian, i.e.,

$$ p(\mathbf{x}_{k-1}|\mathbf{Y}_{k-1}) = \mathcal{N}(\hat{\mathbf{x}}_{k-1}, \mathbf{P}_{k-1}) $$

Then, in the prediction step,

$$ p(\mathbf{x}_{k}|\mathbf{Y}_{k-1}) = \mathcal{N}(\hat{\mathbf{x}}_{k|k-1}, \mathbf{P}_{k|k-1}) $$

where

$$ \hat{\mathbf{x}}_{k|k-1} = \mathbf{F}_{k}\hat{\mathbf{x}}_{k-1}, $$$$ \mathbf{P}_{k|k-1} = \mathbf{F}_{k}\mathbf{P}_{k-1}\mathbf{F}_{k}^\top + \mathbf{Q}_{k}. $$

In the update (correction) step, we compute the posterior of the state given all the measurements up to time $k$, i.e., $p(\mathbf{x}_k|\mathbf{Y}_k)$ and

$$ p(\mathbf{x}_k|\mathbf{Y}_k) = \mathcal{N}(\hat{\mathbf{x}}_{k}, \mathbf{P}_{k}) $$

where

$$ \begin{aligned} \mathbf{P}_{k} &= \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)^{-1}, \\ \hat{\mathbf{x}}_{k} &= \hat{\mathbf{x}}_{k|k-1} + \mathbf{P}_k\mathbf{H}_k^\top\mathbf{R}_k^{-1}(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k|k-1}). \end{aligned} $$

Proof: See Appendix.

Appendix

Recursive Minimum Variance Unbiased Estimation

Assume we already have the observation $\mathbf{y}$ with linear observation matrix $\mathbf{H}$. Then the estimation is given by:

$$ \mathbf{x}_{old} = [\mathbf{H}^\top\mathbf{H}]^{-1}\mathbf{H}^\top\mathbf{y}. $$

After this estimation, we have new observation $\mathbf{y}$ with observation matrix $\mathbf{h}$. Combine the previous observation with current observation, we have the new optimal estimation:

$$ \begin{aligned} \mathbf{x}_{new} &= \left[ \begin{bmatrix} \mathbf{H}\\ \mathbf{h} \end{bmatrix}^\top \begin{bmatrix} \mathbf{H}\\ \mathbf{h} \end{bmatrix} \right]^{-1} \begin{bmatrix} \mathbf{H}\\ \mathbf{h} \end{bmatrix}^\top \begin{bmatrix} \mathbf{Y}\\ \mathbf{y} \end{bmatrix} \\ &=\left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}[\mathbf{H}^\top\mathbf{Y} + \mathbf{h}^\top\mathbf{y}] \end{aligned} $$

Since $\mathbf{x}_{old} = [\mathbf{H}^\top\mathbf{H}]^{-1}\mathbf{H}^\top\mathbf{y}$, we have

$$ \mathbf{H}^\top\mathbf{Y} = \mathbf{H}^\top\mathbf{H}\mathbf{x}_{old}. $$

Substituting this into the equation of the new estimation, we have

$$ \mathbf{x}_{new} = \left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}[\mathbf{H}^\top\mathbf{H}\mathbf{x}_{old} + \mathbf{h}^\top\mathbf{y}] $$

We are seeking for the optimal gain matrix $\mathbf{K}$ so that

$$ \mathbf{x}_{new} = \mathbf{x}_{old} + \mathbf{K}(\mathbf{y} - \mathbf{h}\mathbf{x}_{old}). $$

Then, we have

$$ (\mathbf{I} - \mathbf{K}\mathbf{h})\mathbf{x}_{old} + \mathbf{K}\mathbf{y} = \left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}[\mathbf{H}^\top\mathbf{H}\mathbf{x}_{old} + \mathbf{h}^\top\mathbf{y}] $$

So, we must have

$$ \mathbf{I} - \mathbf{K}\mathbf{h} = \left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}\mathbf{H}^\top\mathbf{H}, $$

and

$$ \mathbf{K} = \left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}\mathbf{h}^\top. $$

The above two equations are consistent by observing that

$$ \begin{aligned} &\mathbf{I} - \mathbf{K}\mathbf{h} \\ &=\mathbf{I} - \left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}\mathbf{h}^\top\mathbf{h} \\ &= \left[\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h} \right]^{-1}\mathbf{H}^\top\mathbf{H}. \end{aligned} $$

Denote $\mathbf{P}_{old} = [\mathbf{H}^\top\mathbf{H}]^{-1}$. Then, using the Woodbury matrix identity, we have

$$ \begin{aligned} \mathbf{P}_{new} &= [\mathbf{H}^\top\mathbf{H} + \mathbf{h}^\top\mathbf{h}]^{-1} \\ &= [\mathbf{P}_{old}^{-1} + \mathbf{h}^\top\mathbf{h}]^{-1} \\ &= \mathbf{P}_{old} - \mathbf{P}_{old}\mathbf{h}^\top(\mathbf{I} + \mathbf{h}\mathbf{P}_{old}\mathbf{h}^\top)^{-1}\mathbf{h}\mathbf{P}_{old} \\ \end{aligned} $$

And the gain matrix is

$$ \begin{aligned} \mathbf{K} &= \left(\mathbf{P}_{old} - \mathbf{P}_{old}\mathbf{h}^\top(\mathbf{I} + \mathbf{h}\mathbf{P}_{old}\mathbf{h}^\top)^{-1}\mathbf{h}\mathbf{P}_{old}\right)\mathbf{h}^\top\\ &= \mathbf{P}_{old}\mathbf{h}^\top(\mathbf{I} - (\mathbf{I} + \mathbf{h}\mathbf{P}_{old}\mathbf{h}^\top)^{-1}\mathbf{h}\mathbf{P}_{old}\mathbf{h}^\top )\\ &= \mathbf{P}_{old}\mathbf{h}^\top(\mathbf{I}+\mathbf{h}\mathbf{P}_{old}\mathbf{h}^\top)^{-1}. \end{aligned} $$

So, the new estimation is

$$ \mathbf{x}_{new} = \mathbf{x}_{old} + \mathbf{K}(\mathbf{y} - \mathbf{h}\mathbf{x}_{old}). $$

In summary, we get the recursive expression of the estimation.

Linear Minimum Variance Estimator

In this section, we are also considering the linear estimator. While, we assume $\mathbf{x}$ is a random variable, not a deterministic parameter. Again, we assume the estimation is linear, i.e.,$\hat{\mathbf{x}} = \mathbf{K}\mathbf{z}$. Similar to the previous section, our objective is to find the optimal $\mathbf{K}$ that minimizes the expected error energy:

$$ \min_{\mathbf{K}} \mathbb{E} \| \mathbf{x} - \hat{\mathbf{x}}\|^2, $$

Here, the expectation is taken with respect to the noise $\mathbf{v}$ and random variable $\mathbf{x}$. Similarly, we expand the objective function:

$$ \begin{aligned} \mathbb{E} \| \mathbf{x} - \hat{\mathbf{x}}\|^2 &= \mathbb{E} \| \mathbf{x} - \mathbf{K}{\mathbf{z}}\|^2\\ &= \mathbb{E} \left(\mathbf{x} - \mathbf{K}\mathbf{z}\right)^\top \left(\mathbf{x} - \mathbf{K}\mathbf{z}\right)\\ &= \mathbb{E} \left( \mathbf{x}^\top \mathbf{x} - \mathbf{x}^\top\mathbf{K}\mathbf{z} - \mathbf{z}^\top\mathbf{K}^\top\mathbf{x} + \mathbf{z}^\top\mathbf{K}^\top\mathbf{K}\mathbf{z}\right)\\ &= \mathbb{E}\Big[ \text{tr}\left( \mathbf{x}\mathbf{x}^\top\right) -2 \text{tr}\left(\mathbf{K}\mathbf{z}\mathbf{x}^\top\right) + \text{tr}\left(\mathbf{K}\mathbf{z}\mathbf{z}^\top\mathbf{K}^\top\right) \Big] \end{aligned} $$

Let $\mathcal{E} = \mathbb{E}\|\mathbf{x} - \hat{\mathbf{x}}\|^2$. Taking derivative of $\mathcal{E}$ with respect to $\mathbf{K}$, we have [1]:

$$ \nabla_{\mathbf{K}} \mathcal{E} = 2\mathbf{K}\mathbb{E}\left(\mathbf{z}\mathbf{z}^\top\right)- 2\mathbb{E}\left(\mathbf{x}\mathbf{z}^\top\right) $$

Letting it equal to zero, we have

$$ \mathbf{K}^* = \mathbb{E}(\mathbf{\mathbf{x}\mathbf{z}^\top})\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}. $$

To get the complete expression of $\mathbf{K}^*$, we need to compute $\mathbb{E}\left(\mathbf{x}\mathbf{z}^\top\right)$ and $\mathbb{E}\left(\mathbf{z}\mathbf{z}^\top\right)$. Substitute $\mathbf{z} = \mathbf{H}\mathbf{x} + \mathbf{v}$, we have:

$$ \begin{aligned} \mathbb{E}(\mathbf{x}\mathbf{z}^\top) &= \mathbb{E}[\mathbf{x}\mathbf{x}^\top\mathbf{H}^\top + \mathbf{x}\mathbf{v}^\top] \\ &=\mathbf{P}\mathbf{H}^\top \end{aligned} $$

where the second equation is due to the fact that $\mathbf{x}$ and $\mathbf{v}$ are independent and $\mathbf{v}$ has zero mean.

The variance matrix of $\mathbf{z}$ is

$$ \begin{aligned} \mathbb{E}(\mathbf{z}\mathbf{z}^\top) &= \mathbb{E}[\mathbf{H}\mathbf{x}\mathbf{x}^\top\mathbf{H}^\top + \mathbf{H}\mathbf{x}\mathbf{v}^\top + \mathbf{v}\mathbf{x}^\top\mathbf{H}^\top + \mathbf{v}\mathbf{v}^\top] \\ &= \mathbf{H}\mathbf{P}\mathbf{H}^\top + \mathbf{R} \end{aligned} $$

So, the optimal linear estimation matrix is

$$ \mathbf{K}^* = \mathbf{P}\mathbf{H}^\top\left(\mathbf{H}\mathbf{P}\mathbf{H}^\top + \mathbf{R}\right)^{-1}. $$

Observing that or according to the matrix inversion lemma [2, 3],

$$ \left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1}\right)\mathbf{P}\mathbf{H}^\top \left(\mathbf{H}\mathbf{P}\mathbf{H}^\top + \mathbf{R}\right)^{-1} \left(\mathbf{H}\mathbf{P}\mathbf{H}^\top+\mathbf{R}\right) = \\ \left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1}\right)\left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1}\right)^{-1}\mathbf{H}^\top\mathbf{R}^{-1} \left(\mathbf{H}\mathbf{P}\mathbf{H}^\top+\mathbf{R}\right) \\ \Leftrightarrow \\ \left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1}\right)\mathbf{P}\mathbf{H}^\top = \mathbf{H}^\top\mathbf{R}^{-1} \left(\mathbf{H}\mathbf{P}\mathbf{H}^\top+\mathbf{R}\right)\\ \Leftrightarrow \\ \mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}\mathbf{P}\mathbf{H}^\top + \mathbf{H}^\top = \mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}\mathbf{P}\mathbf{H}^\top + \mathbf{H}^\top $$

So, the optimal linear estimation matrix can also be represented as:

$$ \mathbf{K}^* =\left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H} + \mathbf{P}^{-1}\right)^{-1}\mathbf{H}^\top\mathbf{R}^{-1}. $$

So, the optimal estimation is

$$ \hat{\mathbf{x}}^* = \left(\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H} + \mathbf{P}^{-1}\right)^{-1}\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{z}. $$

Note, in this derivation, we only assume $\mathbf{x}$ is independent of $\mathbf{v}$, $\mathbf{v}$ is white noise. We do not make assumptions on the distribution of $\mathbf{x}$ and $\mathbf{v}$, only their second moment information is used.

If we set $\mathbf{P}^{-1}=\mathbf{0}$, that means the covariance of the state is infinite and we do not have any prior knowledge about the state. In this case, we get the linear minimum variance unbiased estimator.

The Error Covariance

Having the optimal estimation, we also interested in the error covariance of the estimation.

$$ \begin{aligned} \mathbb{E}(\mathbf{\hat{\mathbf{x}}}-\mathbf{x})(\mathbf{\hat{\mathbf{x}}}-\mathbf{x})^\top &= \mathbb{E}(\mathbf{K}\mathbf{z}-\mathbf{x})(\mathbf{K}\mathbf{z}-\mathbf{x})^\top \\ &=\mathbf{K}\mathbb{E}(\mathbf{z}\mathbf{z}^\top)\mathbf{K}^\top - \mathbf{K}\mathbb{E}(\mathbf{z}\mathbf{x}^\top) - \mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbf{K}^\top + \mathbb{E}(\mathbf{x}\mathbf{x}^\top) \end{aligned} $$

Substituting $\mathbf{K}=\mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}$ into the above equation, we get:

$$ \begin{aligned} \mathbb{E}(\mathbf{\hat{\mathbf{x}}}-\mathbf{x})(\mathbf{\hat{\mathbf{x}}}-\mathbf{x})^\top &= \mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbb{E}(\mathbf{z}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbb{E}(\mathbf{z}\mathbf{x}^\top) \\ &\quad -\mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbb{E}(\mathbf{z}\mathbf{x}^\top)\\ &\quad -\mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbb{E}(\mathbf{z}\mathbf{x}^\top)^\top + \mathbb{E}(\mathbf{x}\mathbf{x}^\top)\\ &= \mathbb{E}(\mathbf{x}\mathbf{x}^\top) - \mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbb{E}(\mathbf{z}\mathbf{x}^\top) \end{aligned} $$

According to the optimal estimation, we have $\mathbb{E}(\mathbf{z}\mathbf{x}^\top) = \mathbb{E}\left((\mathbf{H}\mathbf{x}+\mathbf{v})\mathbf{x}^\top\right) = \mathbf{H}\mathbf{P}$, we have the error covariance:

$$ \begin{aligned} \mathbb{E}(\mathbf{\hat{\mathbf{x}}}-\mathbf{x})(\mathbf{\hat{\mathbf{x}}}-\mathbf{x})^\top &= \mathbf{P} - \mathbf{K}\mathbf{H}\mathbf{P} \\ &=(\mathbf{I}-\mathbf{K}\mathbf{H})\mathbf{P} \\ &=\mathbf{P} - (\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1})^{-1}\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}\mathbf{P} \\ &= (\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1})^{-1}\left((\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1})\mathbf{P} - \mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}\mathbf{P}\right) \\ &= (\mathbf{H}^\top\mathbf{R}^{-1}\mathbf{H}+\mathbf{P}^{-1})^{-1} \end{aligned} $$

Theorem: The minimum variance linear estimate of a linear function of the state is equal to the linear function of the minimum variance estimate of the state. In other words given a matrix $\mathbf{T}$, the minimum variance estimate of $\mathbf{T}\mathbf{x}$ is $\mathbf{T}\hat{\mathbf{x}}$.

Proof: Since $\mathbf{K}_{\mathbf{x}} = \mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbf{z}$, substituting $\mathbf{T}\mathbf{x}$ for $\mathbf{x}$ in the above equation, we have $\mathbf{K}_{\mathbf{T}\mathbf{x}} = \mathbb{E}(\mathbf{T}\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}\mathbf{z}=\mathbf{T}\mathbf{K}_{\mathbf{x}}$.

Theorem: If $\hat{\mathbf{x}}=\mathbf{K}\mathbf{z}$ is the minimum variance estimate of $\mathbf{x}$, then $\hat{\mathbf{x}}$ is also the linear estimate that minimizes $\mathbb{E}(\hat{\mathbf{x}}-\mathbf{x})^\top\mathbf{Q}(\hat{\mathbf{x}}-\mathbf{x})$ for any positive semi-definite matrix $\mathbf{Q}$.

Proof: According to the previous theorem, $\mathbf{Q}^{\frac{1}{2}}\hat{\mathbf{x}}$ is the minimum variance estimate of $\mathbf{Q}^{\frac{1}{2}}\mathbf{x}$, i.e., $\hat{\mathbf{x}}$ is the solution to $\min \mathbb{E}(\mathbf{Q}^{\frac{1}{2}}\mathbf{x}-\mathbf{Q}^{\frac{1}{2}}\hat{\mathbf{x}})^\top(\mathbf{Q}^{\frac{1}{2}}\mathbf{x}-\mathbf{Q}^{\frac{1}{2}}\hat{\mathbf{x}})$.

We can also prove it from scratch where our objective is:

$$ \min_{K} \mathbb{E}(\mathbf{x}-\mathbf{K}\mathbf{z})^\top\mathbf{Q}(\mathbf{x}-\mathbf{K}\mathbf{z}) $$

Let $\mathcal{E} = \mathbb{E}(\mathbf{x}-\mathbf{K}\mathbf{z})^\top\mathbf{Q}(\mathbf{x}-\mathbf{K}\mathbf{z})$, we have:

$$ \begin{aligned} \mathcal{E} &= \mathbb{E}\big[ (\mathbf{x}-\mathbf{K}\mathbf{z})^\top\mathbf{Q}(\mathbf{x}-\mathbf{K}\mathbf{z})\big]\\ &= \mathbb{E}\big[ \mathbf{x}^\top\mathbf{Q}\mathbf{x} - \mathbf{x}^\top\mathbf{Q}\mathbf{K}\mathbf{z} - \mathbf{z}^\top\mathbf{K}^\top\mathbf{Q}\mathbf{x} + \mathbf{z}^\top\mathbf{K}^\top\mathbf{Q}\mathbf{K}\mathbf{z} \big]\\ &= \mathbb{E}\big[\text{tr}(\mathbf{x}\mathbf{x}) -2\text{tr}(\mathbf{Q}\mathbf{K}\mathbf{z}\mathbf{x}^\top)+\text{tr}(\mathbf{K}^\top\mathbf{Q}\mathbf{K}\mathbf{z}\mathbf{z}^\top)] \end{aligned} $$

Taking derivative of $\mathcal{E}$ with respect to $\mathbf{K}$, we have

$$ \nabla_{\mathbf{K}} \mathcal{E} = -2\mathbb{E}\big[ \mathbf{Q}\mathbf{x}\mathbf{z}^\top - \mathbf{Q}\mathbf{K}\mathbf{z}\mathbf{z}^\top \big] $$

Letting it equal zero, we arrive

$$ \mathbf{Q}\left(\mathbb{E}(\mathbf{x}\mathbf{z}^\top) - \mathbf{K}\mathbb{E}(\mathbf{z}\mathbf{z}^\top)\right) = \mathbf{0} $$

Since $\mathbf{Q}$ is positive definite, we have

$$ \mathbb{E}(\mathbf{x}\mathbf{z}^\top) - \mathbf{K}\mathbb{E}(\mathbf{z}\mathbf{z}^\top) = \mathbf{0} $$

Thus, the optimal estimation matrix is

$$ \mathbf{K}^* = \mathbb{E}(\mathbf{x}\mathbf{z}^\top)\mathbb{E}(\mathbf{z}\mathbf{z}^\top)^{-1}, $$

which is independent of $\mathbf{Q}$. Then, we complete the proof.

Observing the optimal condition $\mathbb{E}\left[(\mathbf{x}-\mathbf{K}^*\mathbf{z})\mathbf{z}^\top\right]=\mathbf{0}$, we conclude that the residual $\mathbf{x}-\mathbf{K}^*\mathbf{z}$ is orthogonal to $\mathbf{z}$ (under expectation, or the residual is uncorrelated with the observation). Since the optimal estimation always leads to the residual that is orthogonal to the state, the weight matrix $\mathbf{Q}$ only puts different weights on the components of the residual. So, the optimal estimation matrix is independent of $\mathbf{Q}$.

Posterior of the State

The posterior of the state given all the measurements has the following form:

$$ \ln p(\mathbf{x}_k|\mathbf{Y}_k) \propto \ln p(\mathbf{y}_k|\mathbf{x}_k) + \ln p(\mathbf{x}_k|\mathbf{Y}_{k-1}) $$

Since $p(\mathbf{y}_k|\mathbf{x}_k)$ and $p(\mathbf{x}_k|\mathbf{Y}_{k-1})$ are Gaussian, we have $\ln p(\mathbf{y}_k|\mathbf{x}_k) + \ln p(\mathbf{x}_k|\mathbf{Y}_{k-1})$ is a quadratic function of $\mathbf{x}_k$. Thus, the posterior of the state is also Gaussian. So, we only need to focus on the quadratic and linear terms in terms of $\mathbf{x}_k$.

$$ \begin{aligned} \ln p(\mathbf{x}_k|\mathbf{Y}_k) &\propto (\mathbf{y}_k - \mathbf{H}_k\mathbf{x}_k)^\top \mathbf{R}_k^{-1} (\mathbf{y}_k - \mathbf{H}_k\mathbf{x}_k) + (\mathbf{x}_k - \mathbf{\hat{x}}_{k|k-1})^\top \mathbf{P}_{k|k-1}^{-1} (\mathbf{x}_k - \mathbf{\hat{x}}_{k|k-1})\\ &\propto \mathbf{x}_k^\top (\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}) \mathbf{x}_k - 2\mathbf{x}_k^\top\left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{y}_k + \mathbf{P}_{k|k-1}^{-1}\hat{\mathbf{x}}_{k|k-1}\right) \end{aligned} $$

So the posterior covariance is

$$ \mathbf{P}_k = \left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1}\right)^{-1}, $$

and the mean is

$$ \begin{aligned} \hat{\mathbf{x}}_k &= \mathbf{P}_k\left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}\mathbf{y}_k + \mathbf{P}_{k|k-1}^{-1}\hat{\mathbf{x}}_{k|k-1}\right)\\ &= \mathbf{P}_k\left(\mathbf{H}_k^\top \mathbf{R}_k^{-1}(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k|k-1}) + \mathbf{P}_{k|k-1}^{-1}\hat{\mathbf{x}}_{k|k-1} + \mathbf{H}_k^\top\mathbf{R}_k^{-1}\mathbf{H}_k\hat{\mathbf{x}}_{k|k-1}\right)\\ &= \mathbf{P}_k(\mathbf{H}_k^\top\mathbf{R}_k^{-1}\mathbf{H}_k + \mathbf{P}_{k|k-1}^{-1})\hat{\mathbf{x}}_{k|k-1} + \mathbf{P}_k \mathbf{H}_k^\top\mathbf{R}_k^{-1}(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k|k-1})\\ &= \hat{\mathbf{x}}_{k|k-1} + \mathbf{P}_k \mathbf{H}_k^\top\mathbf{R}_k^{-1}(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k|k-1}). \end{aligned} $$

In summary, the posterior of the state given all the measurements is Gaussian, i.e.,

$$ p(\mathbf{x}_k|\mathbf{Y}_k) = \mathcal{N}(\hat{\mathbf{x}}_k, \mathbf{P}_k). $$

Updating the Estimate

The following content is a variation of the [1] and is also the method used in [5].

We consider the problem of updating the estimate of $\mathbf{x}$ if additional data becomes available.

First we define the sum of two vector subspaces $\mathcal{Y}_1 + \mathcal{Y}_2$ of a Hilbert space $\mathcal{H}$ as consisting of all vectors in the form of $y_1 + y_2$ where $y_1 \in \mathcal{Y}_1$ and $y_2 \in \mathcal{Y}_2$. We also define the vector space $\mathcal{Y}$ as the direct sum of two vector subspaces $\mathcal{Y} = \mathcal{Y}_1 \oplus \mathcal{Y}_2$ if every vector $y \in \mathcal{Y}$ has a unique representation in the form of $y = y_1 + y_2$ where $y_1 \in \mathcal{Y}_1$ and $y_2 \in \mathcal{Y}_2$.

We know that if $\mathcal{Y}_1$ and $\mathcal{Y}_2$ are two subspaces of a Hilbert space $\mathcal{H}$, then $\mathcal{Y}_1 + \mathcal{Y}_2$ is also a subspace of the space. We also know that if the subspace $\widetilde{\mathcal{Y}}_2$ is chosen such that $\widetilde{\mathcal{Y}}_2 \perp \mathcal{Y}_1$ and $\widetilde{\mathcal{Y}}_2 \oplus \mathcal{Y}_1 = \mathcal{Y}_1 + \mathcal{Y}_2$, then the projection of a vector $\beta \in \mathcal{H}$ onto $\mathcal{Y}_1 + \widetilde{\mathcal{Y}}_2$ is equal to the projection of $\beta$ onto $\mathcal{Y}_1$ plus the projection of $\beta$ onto $\widetilde{\mathcal{Y}}_2$. This is visualized in the following figure.

The updating of estimate.
The projection onto different subspaces. [1]

Theorem: Let $\mathbf{x}_i$ be a random variable and $\hat{\mathbf{x}}_i$ be the minimum variance estimate of $\mathbf{x}_i$, given the random vector $\mathbf{z}_1$. Just like the proof of the minimum variance estimator theorem, the elements of $\mathbf{z}_1$ span a subspace $\mathcal{Z}_1 \triangleq$ all linear combinations of the elements of $\mathbf{z}_1$.

Let $\mathbf{z}_2$ be a random vector and the elements of $\mathbf{z}_2$ span a subspace $\mathcal{Z}_2$. Let $\hat{\mathbf{z}}_2$ be the minimum variance estimate of $\mathbf{z}_2$ in $\mathcal{Z}_1$. By the minimum variance estimate theorem, this is equivalent to saying that $\hat{\mathbf{z}}_2$ is the orthogonal projection of the elements of $\mathbf{z}_2$ onto $\mathcal{Z}_1$.

Let $\tilde{\mathbf{z}}_2 = \mathbf{z}_2 - \hat{\mathbf{z}}_2$, then the minimum variance estimate of $\mathbf{z}_2$ in $\mathcal{Z}_1$, given $\mathbf{z}_1$, is denoted by $\hat{\mathbf{z}}_2$ and can be found as

$$ \hat{\mathbf{z}}_2 = \hat{\mathbf{z}}_1 + \mathbb{E}[\mathbf{x} \tilde{\mathbf{z}}_2^\top] \left[ \mathbb{E}[\tilde{\mathbf{z}}_2 \tilde{\mathbf{z}}_2^\top] \right]^{-1} \tilde{\mathbf{z}}_2. $$

This is equal to saying that the orthogonal projection of $\mathbf{x}$ onto $\mathcal{Z}_1 + \mathcal{Z}_2$ is denoted by $\hat{\mathbf{z}}_2$. In other words $\hat{\mathbf{z}}_2$ is $\hat{\mathbf{z}}_1$ plus the minimum variance estimate of $\mathbf{x}$ given the random vector $\tilde{\mathbf{z}}_2$. This is similar to finding the orthogonal projection of $\hat{\mathbf{z}}_1$ onto $\widetilde{\mathcal{Z}}_2$ which is generated by $\tilde{\mathbf{z}}_2$.

$$ \hat{\mathbf{z}}_2^* = \arg\min_{\tilde{\mathbf{z}}_2 \in \mathcal{Z}_1} \mathbb{E} \left\| \tilde{\mathbf{z}}_2 - \mathbf{z}_2 \right\|^2 $$

Now, the saying that “this is equivalent to saying that $\hat{\mathbf{z}}_2$ is the orthogonal projection of the elements of $\mathbf{z}_2$ onto $\mathcal{Z}_1$” is clear.

The condition $\tilde{\mathbf{z}}_2 \in \mathcal{Z}_1$ can be written as $\tilde{\mathbf{z}}_2 = \mathbf{K}\mathbf{z}_1$. Thus, the optimization above is equivalent to

$$ \begin{aligned} \underset{\mathbf{K}}{\text{argmin}}&\, \mathbb{E} \left\| \mathbf{K}\mathbf{z}_1 - \mathbf{z}_2 \right\|^2 \\ &= \mathbb{E} \left\| \mathbf{K}\mathbf{z}_1 - \mathbf{W}\mathbf{x} + \mathbf{v} \right\|^2 \\ \end{aligned} $$

Then, $\mathbf{K}$ is the optimal estimation matrix of $\mathbf{W}\mathbf{x}+\mathbf{v}$. Since $\mathbf{v}$ is zero-mean, the optimal estimation matrix is $\mathbf{W}\hat{\mathbf{x}}$. So, when we want to get the projection of $\mathbf{z}_2$ onto $\mathcal{Z}_1$, we can easily plug in the optimal estimation of $\mathbf{x}$ into $\mathbf{W}\mathbf{x}$ and get $\tilde{\mathbf{z}}_2 = \mathbf{W}\hat{\mathbf{x}}$.

References

[1] H. Masnadi-Shirazi, A. Masnadi-Shirazi, and M.-A. Dastgheib, “A step by step mathematical derivation and tutorial on Kalman filters,” arXiv preprint arXiv:1910.03558, Oct. 2019. [Online]. Available: http://arxiv.org/abs/1910.03558

[2] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, version November 15, 2012. [Online]. Available: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

[3] T. Lienart, “Matrix inversion lemmas,” tlienart.github.io, Dec. 13, 2018. [Online]. Available: https://tlienart.github.io/posts/2018/12/13-matrix-inversion-lemmas/

[4] J. Humpherys, P. Redd, and J. West, “A Fresh look at the Kalman filter,” SIAM Rev., vol. 54, no. 4, pp. 801–823, Jan. 2012.

[5] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, Mar. 1960