Diffusion models have emerged as a powerful and flexible class of generative models, underpinning recent breakthroughs in image, audio, and scientific data generation. This article offers a gentle, beginner-friendly overview of diffusion models and the related flow matching framework, demystifying their mathematical foundations and practical training procedures. We will delve into the core concepts, guiding equations, and step-by-step training algorithms, aiming to provide readers with both an intuition for how these models work and a roadmap for implementing them in practice. This article serves as notes on papers [1] and [2].
Generative Models
The original motivation behind diffusion models is to generate high-quality, diverse data. To achieve this, we map the observed data distribution to a simple base distribution, sample from the base, and map the samples back to data space. Thus, the goal of a generative model is to learn a mapping between the data and base distributions (in either direction).
Diffusion Models
Flow Matching
We mainly follow tutorial [1] to introduce flow matching and diffusion models.
Flow is a solution to the ODE
$$ \psi : \mathbb{R}^d \times [0,1] \to \mathbb{R}^d, \quad (x_0, t) \mapsto \psi_t(x_0) \tag{2a} $$$$ \frac{d}{dt} \psi_t(x_0) = u_t\bigl(\psi_t(x_0)\bigr) \tag{2b} $$$$ \psi_0(x_0) = x_0 \tag{2c} $$For a given initial condition $X_0 = x_0$, a trajectory of the ODE is recovered via $X_t = \psi_t(X_0)$. Therefore, vector fields, ODEs, and flows are, intuitively, three descriptions of the same object: vector fields define ODEs whose solutions are flows.
Solutions are flows. As with every equation, we should ask: does a solution exist, and if so, is it unique? Under mild assumptions on $u_t$, the answer is “yes” to both:
Theorem 3 (Flow existence and uniqueness)
If $u : \mathbb{R}^d \times [0,1] \to \mathbb{R}^d$ is continuously differentiable with a bounded derivative, then the ODE in (2) has a unique solution given by a flow $\psi_t$. In this case, $\psi_t$ is a diffeomorphism for all $t$, i.e. $\psi_t$ is continuously differentiable with a continuously differentiable inverse $\psi_t^{-1}$.
Simulating an ODE. In general, it is not possible to compute the flow $\psi_t$ explicitly if $u_t$ is not as simple as a linear function. In these cases, one uses numerical methods to simulate ODEs. Fortunately, this is a classical and well researched topic in numerical analysis, and a myriad of powerful methods exist [11]. One of the simplest and most intuitive methods is the Euler method. In the Euler method, we initialize with $X_0 = x_0$ and update via
$$ X_{t+h} = X_t + h\,u_t(X_t) \qquad (t = 0, h, 2h, 3h, \ldots, 1 - h) \tag{4} $$where $h = n^{-1} > 0$ is a step size hyperparameter with $n \in \mathbb{N}$. For this class, the Euler method will be good enough. To give you a taste of a more complex method, let us consider Heun’s method defined via the update rule
$$ X'_{t+h} = X_t + h\,u_t(X_t) \quad\text{(initial guess of new state)} $$$$ X_{t+h} = X_t + \frac{h}{2}\bigl(u_t(X_t) + u_{t+h}(X'_{t+h})\bigr) \quad\text{(update with average u at current and guessed state)} $$Intuitively, Heun’s method is as follows: it takes a first guess $X'_{t+h}$ of what the next step could be but corrects the direction initially taken via an updated guess.
Flow models. We can now construct a generative model via an ODE. Remember that our goal was to convert a simple distribution $p_{\text{init}}$ into a complex distribution $p_{\text{data}}$. The simulation of an ODE is thus a natural choice for this transformation. A flow model is described by the ODE
$$ X_0 \sim p_{\text{init}} \qquad \triangleright\ \text{random initialization} $$$$ \frac{d}{dt} X_t = u_t^\theta(X_t) \qquad \triangleright\ \text{ODE} $$where the vector field $u_t^\theta$ is a neural network $u_t^\theta$ with parameters $\theta$. For now, we will speak of $u_t^\theta$ as being a generic neural network; i.e. a continuous function $u_t^\theta : \mathbb{R}^d \times [0,1] \to \mathbb{R}^d$ with parameters $\theta$. Later, we will discuss particular choices of neural network architectures. Our goal is to make the endpoint $X_1$ of the trajectory have distribution $p_{\text{data}}$, i.e.
$$ X_1 \sim p_{\text{data}} \quad \Leftrightarrow \quad \psi_1^\theta(X_0) \sim p_{\text{data}} $$where $\psi_t^\theta$ describes the flow induced by $u_t^\theta$. Note however: although it is called flow model, the neural network parameterizes the vector field, not the flow. In order to compute the flow, we need to simulate the ODE. In algorithm 1, we summarize the procedure how to sample from a flow model.
Diffusion Models
In eq. (1a). Hence, we need to find an equivalent formulation of ODEs that does not use derivatives. where $R_t(h)$ describes a negligible function for small $h$, i.e. such that $\lim_{h \to 0} R_t(h) = 0$, and in (i) we simply use the definition of derivatives. The derivation above simply restates what we already know: A trajectory $(X_t)_{0 \le t \le 1}$ of an ODE takes, at every timestep, a small step in the direction $u_t(X_t)$. We may now amend the last equation to make it stochastic: A trajectory $(X_t)_{0 \le t \le 1}$ of an SDE takes, at every timestep, a small step in the direction $u_t(X_t)$ plus some contribution from a Brownian motion:
$$ X_{t+h}= X_t + \underbrace{h\,u_t(X_t)}_{\text{deterministic}} + \underbrace{\sigma_t\bigl(W_{t+h} - W_t\bigr)}_{\text{stochastic}}+ \underbrace{h\,R_t(h)}_{\text{error term}}\tag{6} $$where $\sigma_t \ge 0$ describes the diffusion coefficient and $R_t(h)$ describes a stochastic error term such that the standard deviation $\mathbb{E}\bigl[\lVert R_t(h)\rVert^2\bigr]^{1/2} \to 0$ goes to zero for $h \to 0$. The above describes a stochastic differential equation (SDE).
It is common to denote a stochastic differential equation (SDE) in the following symbolic notation:
$$ \mathrm{d}X_t = u_t(X_t)\,\mathrm{d}t + \sigma_t\,\mathrm{d}W_t \tag{7a} $$$$ X_0 = x_0 \tag{7b} $$Theorem 5 (SDE Solution Existence and Uniqueness)
If $u : \mathbb{R}^d \times [0,1] \to \mathbb{R}^d$ is continuously differentiable with a bounded derivative and $\sigma_t$ is continuous, then the SDE in (7) has a solution given by the unique stochastic process $(X_t)_{0 \le t \le 1}$.
Simulating an SDE. If you struggle with the abstract definition of an SDE so far, then don’t worry about it. A more intuitive way of thinking about SDEs is given by answering the question: How might we simulate an SDE? The simplest such scheme is known as the Euler–Maruyama method, and is essentially to SDEs what the Euler method is to ODEs. Using the Euler–Maruyama method, we initialize $X_0 = x_0$ and update iteratively via
$$ X_{t+h} = X_t + h u_t(X_t) + \sqrt{h}\,\sigma_t \epsilon_t, \qquad \epsilon_t \sim \mathcal{N}(0, I_d) \tag{9} $$where $h = n^{-1} > 0$ is a step size hyperparameter for $n \in \mathbb{N}$. In other words, to simulate using the Euler–Maruyama method, we take a small step in the direction of $u_t(X_t)$ as well as add a little bit of Gaussian noise scaled by $\sqrt{h}\,\sigma_t$. When simulating SDEs in this class (such as in the accompanying labs), we will usually stick to the Euler–Maruyama method.
Summary 7 (SDE generative model)
Throughout this document, a diffusion model consists of a neural network $u_t^\theta$ with parameters $\theta$ that parameterize a vector field and a fixed diffusion coefficient $\sigma_t$:
- Neural network: $u^\theta : \mathbb{R}^d \times [0,1] \to \mathbb{R}^d,\ (x,t) \mapsto u_t^\theta(x)$ with parameters $\theta$
- Fixed: $\sigma_t : [0,1] \to [0,\infty),\ t \mapsto \sigma_t$
To obtain samples from our SDE model (i.e. generate objects), the procedure is as follows:
- Initialization: $X_0 \sim p_{\text{init}}$ ▸ Initialize with simple distribution, e.g. a Gaussian
- Simulation: $\mathrm{d}X_t = u_t^\theta(X_t)\,\mathrm{d}t + \sigma_t\,\mathrm{d}W_t$ ▸ Simulate SDE from $0$ to $1$
- Goal: $X_1 \sim p_{\text{data}}$ ▸ Goal is to make $X_1$ have distribution $p_{\text{data}}$
A diffusion model with $\sigma_t = 0$ is a flow model.
Idealy, we would like to learn a marginal vector field $u_t$ that maps the noise distribution $p_{\text{init}}$ to the data distribution $p_{\text{data}}$. However, it involves learning the marginal distribution $p_t$, which is intractable. Therefore, we need to use the marginalization trick. For marginal vector field, it does not depend on the distribution of the whole data. So, we can construct some conditional vector fields $u_t^{\text{target}}(x \mid z)$ for each data point $z$. By this sepcial construction, we can get closed-form expression of the conditional vector fields. The following theorem shows the relationship between the marginal vector field and the conditional vector field.
Theorem 10 (Marginalization trick)
For every data point $z \in \mathbb{R}^d$, let $u_t^{\text{target}}(\cdot \mid z)$ denote a conditional vector field, defined so that the corresponding ODE yields the conditional probability path $p_t(\cdot \mid z)$, viz.,
Then the marginal vector field $u_t^{\text{target}}(x)$, defined by
$$ u_t^{\text{target}}(x) =\int u_t^{\text{target}}(x \mid z) \frac{p_t(x \mid z)\, p_{\text{data}}(z)}{p_t(x)} \, dz, \tag{19} $$follows the marginal probability path, i.e.
$$ X_0 \sim p_{\text{init}}, \qquad \frac{d}{dt} X_t = u_t^{\text{target}}(X_t) \;\;\Rightarrow\;\; X_t \sim p_t \qquad (0 \le t \le 1). \tag{20} $$In particular, $X_1 \sim p_{\text{data}}$ for this ODE, so that we might say “$u_t^{\text{target}}$ converts noise $p_{\text{init}}$ into data $p_{\text{data}}$”.
The following theorem shows the relationship between the vector field and the distribution. It tells that the change of the distribution over time is equal to the probability move according to the vector field.
$$ \partial_t p_t(x) = -\operatorname{div}\big(p_t u_t^{\text{target}}\big)(x) \quad \text{for all } x \in \mathbb{R}^d,\, 0 \le t \le 1, \tag{24} $$
where $\partial_t p_t(x) = \frac{d}{dt} p_t(x)$ denotes the time-derivative of $p_t(x)$. Equation 24 is known as the continuity equation.
where the divergence operator $\operatorname{div}$, is defined as:
$$ \operatorname{div}(v_t)(x) = \sum_{i=1}^d \frac{\partial}{\partial x_i} v_t(x) \tag{23} $$We can use the continuity equation to derive the relationship between the marginal vector field and the conditional vector field.
We may also see the following form of the theorem
Theorem 13 (SDE extension trick)
Define the conditional and marginal vector fields $u_t^{\text{target}}(x \mid z)$ and $u_t^{\text{target}}(x)$ as before. Then, for diffusion coefficient $\sigma_t \ge 0$, we may construct an SDE which follows the same probability path:
$$ X_0 \sim p_{\text{init}}, \qquad \mathrm{d}X_t =\Big[ u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(X_t) \Big] \mathrm{d}t + \sigma_t \mathrm{d}W_t \tag{25} $$$$ \Rightarrow\quad X_t \sim p_t \qquad (0 \le t \le 1) \tag{26} $$In particular, $X_1 \sim p_{\text{data}}$ for this SDE. The same identity holds if we replace the marginal probability $p_t(x)$ and vector field $u_t^{\text{target}}(x)$ with the conditional probability path $p_t(x \mid z)$ and vector field $u_t^{\text{target}}(x \mid z)$.
Theorem 15 (Fokker–Planck Equation)
Let $p_t$ be a probability path and let us consider the SDE
$$ X_0 \sim p_{\text{init}}, \qquad \mathrm{d}X_t = u_t(X_t)\,\mathrm{d}t + \sigma_t\,\mathrm{d}W_t. $$Then $X_t$ has distribution $p_t$ for all $0 \le t \le 1$ if and only if the Fokker–Planck equation holds:
$$ \partial_t p_t(x) = -\operatorname{div}\big(p_t u_t\big)(x) + \frac{\sigma_t^2}{2}\,\Delta p_t(x) \quad \text{for all } x \in \mathbb{R}^d,\; 0 \le t \le 1. \tag{30} $$Summary 17 (Derivation of the Training Target)
The flow training target is the marginal vector field $u_t^{\text{target}}$. To construct it, we choose a conditional probability path $p_t(x\mid z)$ that fulfils $p_0(\cdot \mid z) = p_{\text{init}},\ p_1(\cdot \mid z) = \delta_z$. Next, we find a conditional vector field $u_t^{\text{flow}}(x\mid z)$ such that its corresponding flow $\psi_t^{\text{target}}(x\mid z)$ fulfills
$$ X_0 \sim p_{\text{init}} \;\Rightarrow\; X_t = \psi_t^{\text{target}}(X_0\mid z) \sim p_t(\cdot \mid z), $$or, equivalently, that $u_t^{\text{target}}$ satisfies the continuity equation. Then the marginal vector field defined by
$$ u_t^{\text{target}}(x) =\int u_t^{\text{target}}(x\mid z) \frac{p_t(x\mid z)p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z, \tag{32} $$follows the marginal probability path, i.e.,
$$ X_0 \sim p_{\text{init}}, \quad \mathrm{d}X_t = u_t^{\text{target}}(X_t)\,\mathrm{d}t \;\Rightarrow\; X_t \sim p_t \quad (0 \le t \le 1). \tag{33} $$In particular, $X_1 \sim p_{\text{data}}$ for this ODE, so that $u_t^{\text{target}}$ “converts noise into data”, as desired.
Extending to SDEs. For a time-dependent diffusion coefficient $\sigma_t \ge 0$, we can extend the above ODE to an SDE with the same marginal probability path:
$$ X_0 \sim p_{\text{init}}, \quad \mathrm{d}X_t = \Big[ u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2}\nabla \log p_t(X_t) \Big] \mathrm{d}t + \sigma_t \mathrm{d}W_t \tag{34} $$$$ \Rightarrow\; X_t \sim p_t \quad (0 \le t \le 1), \tag{35} $$where $\nabla \log p_t(x)$ is the marginal score function
$$ \nabla \log p_t(x) =\int \nabla \log p_t(x\mid z) \frac{p_t(x\mid z)p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z. \tag{36} $$Conditional and Marginal Score Functions
In particular, for the trajectories $X_t$ of the above SDE, it holds that $X_1 \sim p_{\text{data}}$, so that the SDE “converts noise into data”, as desired. An important example is the Gaussian probability path, yielding the formulae:
$$ p_t(x\mid z) =\mathcal{N}(x; \alpha_t z, \beta_t^2 I_d) \tag{37} $$$$ u_t^{\text{flow}}(x\mid z) =\left(\dot{\alpha}_t- \frac{\dot{\beta}_t}{\beta_t}\alpha_t \right) z + \frac{\dot{\beta}_t}{\beta_t} x \tag{38} $$$$ \nabla \log p_t(x\mid z) =- \frac{x - \alpha_t z}{\beta_t^2}, \tag{39} $$for noise schedulers $\alpha_t, \beta_t \in \mathbb{R}$: continuously differentiable, monotonic functions such that $\alpha_0 = \beta_1 = 0$, $\alpha_1 = \beta_0 = 1$.
Training the Generative Model
$$ \mathcal{L}_{\text{FM}}(\theta) = \mathcal{L}_{\text{CFM}}(\theta) + C, $$$$ \nabla_\theta \mathcal{L}_{\text{FM}}(\theta) = \nabla_\theta \mathcal{L}_{\text{CFM}}(\theta). $$Hence, minimizing $\mathcal{L}_{\text{CFM}}(\theta)$ with e.g., stochastic gradient descent (SGD) is equivalent to minimizing $\mathcal{L}_{\text{FM}}(\theta)$ in the same fashion. In particular, for the minimizer $\theta^*$ of $\mathcal{L}_{\text{CFM}}(\theta)$, it will hold that $u_t^{\theta^*} = u_t^{\text{target}}$ (assuming an infinitely expressive parameterization).
Algorithm 3 Flow Matching Training Procedure (here for Gaussian CondOT path $p_t(x\mid z) = \mathcal{N}(tz, (1 - t)^2)$)
Require: A dataset of samples $z \sim p_{\text{data}}$, neural network $u_t^\theta$
$$\mathcal{L}(\theta) = \lVert u_t^\theta(x) - (z - \epsilon) \rVert^2$$
(General case: $= \lVert u_t^\theta(x) - u_t^{\text{target}}(x\mid z) \rVert^2$)
7: Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.
8: end for
Algorithm 4 Score Matching Training Procedure for Gaussian probability path
Require: A dataset of samples $z \sim p_{\text{data}}$, score network $s_t^\theta$ or noise predictor $\epsilon_t^\theta$
1: for each mini-batch of data do
2: Sample a data example $z$ from the dataset.
3: Sample a random time $t \sim \text{Unif}_{[0,1]}$.
4: Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$ (General case: $x_t \sim p_t(\cdot \mid z)$)
5: Set $x_t = \alpha_t z + \beta_t \epsilon$
6: Compute loss
Alternatively:
$$ \mathcal{L}(\theta) = \lVert \epsilon_t^\theta(x_t) - \epsilon \rVert^2 $$7: Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.
8: end for
Building an Image Generator
The objective of our network is to learn an mixture of the guided and unguided vector fields.
$$ \tilde{u}_t(x|y) = u_t^{\text{target}}(x) + w b_t \nabla \log p_t(y|x) \\ = u_t^{\text{target}}(x) + w b_t (\nabla \log p_t(x|y) - \nabla \log p_t(x)) \\ = u_t^{\text{target}}(x) - (w a_t x + w b_t \nabla \log p_t(x)) + (w a_t x + w b_t \nabla \log p_t(x|y)) \\ = (1 - w) u_t^{\text{target}}(x) + w u_t^{\text{target}}(x|y) $$Summary 27 (Classifier-Free Guidance for Flow Models)
$$ \tilde{u}_t(x|y) = (1-w)u_t^{\text{target}}(x|\varnothing) + w u_t^{\text{target}}(x|y). \tag{70} $$$$ \mathcal{L}^{\text{CFG}}_{\text{CFM}}(\theta) = \mathbb{E}_{\square}\big\lVert u_t^{\theta}(x|y) - u_t^{\text{target}}(x|z) \big\rVert^2 \tag{71} $$$$ \square = (z,y) \sim p_{\text{data}}(z,y),\quad t \sim \text{Unif}[0,1],\quad x \sim p_t(\cdot|z),\ \text{replace } y = \varnothing \text{ with prob. } \eta. \tag{72} $$In plain English, $\mathcal{L}^{\text{CFG}}_{\text{CFM}}$ might be approximated by
- $(z,y) \sim p_{\text{data}}(z,y)$ — Sample $(z,y)$ from data distribution.
- $t \sim \text{Unif}[0,1]$ — Sample $t$ uniformly on $[0,1]$.
- $x \sim p_t(x|z)$ — Sample $x$ from the conditional probability path $p_t(x|z)$.
- with prob. $\eta$, $y \leftarrow \varnothing$ — Replace $y$ with $\varnothing$ with probability $\eta$.
At inference time, for a fixed choice of $y$, we may sample via
- Initialization: $X_0 \sim p_{\text{init}}(x)$ — Initialize with simple distribution (such as a Gaussian).
- Simulation: $\mathrm{d}X_t = \tilde{u}_t^{\theta}(X_t|y)\,\mathrm{d}t$ — Simulate ODE from $t=0$ to $t=1$.
- Samples: $X_1$ — Goal is for $X_1$ to adhere to the guiding variable $y$.
Back to Basics: Let Denoising Generative Models Denoise
This paper [2] is simple yet inspiring. The toy example in Fig. 2 clearly shows why it is preferable to predict the original data rather than the noise. Table 1 also summarizes how different objectives are related.

References
[1] P. Holderrieth and E. Erives, “An Introduction to Flow Matching and Diffusion Models,” July 12, 2025, arXiv: arXiv:2506.02070. doi: 10.48550/arXiv.2506.02070.
[2] T. Li and K. He, “Back to Basics: Let Denoising Generative Models Denoise,” Nov. 17, 2025, arXiv:2511.13720. doi: 10.48550/arXiv.2511.13720.