This post provides a technical deep dive into the Wayformer paper [1], a key publication in the field of motion forecasting.

Training Overview

Wayformer training pipeline — An overview of the deep learning training pipeline, illustrating the data flow and key components involved during model training.

Model

Overview of the One-Stage E2E model

Overview of the Two-Stage E2E model

Details of the Two-Stage E2E Model

Model Structure Overview

Feature Embedding/Feature Projection

$$\mathbf{f}\in \mathbb{R}^{T \times N\times D} \to \mathbf{x}_{input} \in \mathbb{R}^{(T \cdot N) \times d}$$

Where $T$ is the number of time history, $N$ is the number of entities, $D$ is the number of features, and $d=256$.

Feature projection

$$\mathbf{x}_{in} = \mathbf{f} \mathbf{W}$$

Where $\mathbf{W} \in \mathbb{R}^{D\times d}$ and $\mathbf{x}_{in} \in \mathbb{R}^{T\times N \times d}$

Add time and position embedding

$$\mathbf{x}_{input} = \mathbf{x}_{in} + \mathbf{p}_t + \mathbf{p}_s$$

Where time embedding: $\mathbf{p}_t \in \mathbb{R}^{1 \times N \times d} $, position embedding: $\mathbf{p}_s \in \mathbb{R}^{T \times 1 \times d}$. They will broadcast the time and agent dimension respectively.

Spatio-Temporal Feature Flattening

$$\mathbb{R}^{T\times N \times d} \to \mathbb{R}^{(T\cdot N) \times d}$$

Agent and ego

$$\mathbb{R}^{11\times 32 \times 28}$$

Features:

Positions (3): x, y, z
Dimensions (3): Length, Width, Height
Object type (3): One-hot encoding of type (e.g., car, pedestrian, cyclist)
is_tracked (1): A flag indicating if the object is tracked or predicted
Ego car indicator (1)
Time embedding (11): e.g., [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] indicates the state is from one step before the last time step.
Exact time (1)
Heading (2): cos(yaw) and sin(yaw)
Velocity (2): $v_x$, $v_y$
Validity (1)

In the implementation, the dimensions are $T = 11$ (10 history + 1 current), $N = 32$ (31 agents + 1 ego), and $D=28$.

Map

The map is represented as hotmaps of segmentations.

$$\mathbf{x}^{mp} \in \mathbb{R}^{H\times W \times C},$$

where $H=560, W=160, C=11$. The channel meanings are: 3 for segmentation type, 6 for lane shape, and 2 for lane color.

There are two ways to process the map features:

Patchify-Based

Divide H and W into a grid of patches. This results in a tensor with shape $H/N^p \times W/N^p \times N^p \times N^p \times C$.
Flatten each patch into a vector, resulting in a tensor with shape: $1 \times (H/N^p \times W/N^p) \times (N^p \times N^p \times C)$.

Image-Based (CNN)

Start with a segmentation map of size: $560 \times 160 \times 11$.
Process it through a series of CNN layers to get a feature map of size: $70 \times 20 \times 256$.
Reshape this feature map into a sequence of tokens: $1 \times 1400 \times 256$.

$$\mathbb{R}^{1\times 200 \times 2}$$

Raw inputs: a list of waypoints from the navigation applications.

Requests based on current position from the RTK.

Process: find the first point within range of interests; transform waypoints into ego coordinate

Resample to a fixed number of points with fixed spacing: if the path is too short, pad with zeros; if the path is too long, retain the first pre-defined number of points
Change each point into the ego coordinate

Features:

Positions: x, y in ego coordinate

Update:

Request new navigation from current RTK pose or retain from previously computed ones
Transform to ego coordinate according to relative pose

Route

$$\mathbb{R}^{1\times1500\times 2}$$

Contents: similar to navigation but are global, longer and has more points.

For example, it has 50 points and 30m arc spacing between two points.

Comparison between route and navigation

	Navigation	Route
Points	40	30
Spacing	5m	50m

Traffic Light

$$\mathbf{x}\in \mathbb{R}^{1\times 10 \times 16}$$

Features:

Positions (3): x, y, z
Types (9): left turn, right turn, straight, U-turn, etc
Color (3): red, yellow, green
Confidence Score

Road Sign

$$\mathbb{R}^{1\times 10 \times 14}$$

Contents: stop, yield, speed limit, no entry, pedestrian crossing, turn restriction, etc.

Features:

Position (2): x, y
Bounding box corners (8)
Types (3): crosswalk, bump, no_parking_area
Confidence score (1)

Road Arrow

$$\mathbb{R}^{1\times 10 \times 20}$$

Contents: directional arrows on the road surface that indicate allowed or recommended driving directions (such as turn left, go straight, turn right, etc.)

Features:

Center position (2): x, y
Direction (2) : cos, sin
Bounding box corners (8)
Type (7): left, right, straight, U-turn, etc
Score(1)

Ground Truth Trajectory

The next 5 or 8 seconds’ trajectories on the current coordinate:

$$\mathbb{R}^{50\times 2}$$

The trajectory is from SLAM, RTK, or cyber pose. So, relative pose accuracy is important.

Perceivers

Standard Transformer

Standard Encoder Decoder Structure

Attention mechanism illustration — Illustration of the attention mechanism in the Transformer architecture [2].

Tokenize/Projecting the original input into feature vector: $\mathbf{x}_{in}$
Add position embedding feature to input: $\mathbf{x}_{input} = \mathbf{x}_{in} + \mathbf{x}_{pos}$
Map input into key, value, query: $\mathbf{x}_{input} \to \mathbf{Q}, \mathbf{K},\mathbf{V}$
Self attention in encoder
Starting with a given token, e.g., <start>, self attention, and cross attention with encoded in decoder

Details in Attention

$$ \boxed{ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \right) \mathbf{V} } $$

We define:

$\mathbf{Q} \in \mathbb{R}^{n \times d_k}$, Query matrix $\mathbf{Q} = \begin{bmatrix} \mathbf{q}_1^\top \\ \mathbf{q}_2^\top \\ \mathbf{q}_3^\top \\ \vdots\\ \mathbf{q}_n^\top \\ \end{bmatrix} $
$\mathbf{K} \in \mathbb{R}^{m \times d_k}$, Key matrix $\mathbf{K} = \begin{bmatrix} \mathbf{k}_1^\top \\ \mathbf{k}_2^\top \\ \mathbf{k}_3^\top \\ \vdots\\ \mathbf{k}_m \end{bmatrix} $
$\mathbf{V} \in \mathbb{R}^{m \times d_v}$, Value matrix $\mathbf{V}= \begin{bmatrix} \mathbf{v}_1^\top \\ \mathbf{v}_2^\top \\ \mathbf{v}_3^\top \\ \vdots\\ \mathbf{v}_m^\top \\ \end{bmatrix} $

where

$\mathbf{q}_i \in \mathbb{R}^{d_k}$, $\mathbf{k}_j \in \mathbb{R}^{d_k}$, $\mathbf{v}_j \in \mathbb{R}^{d_v}$
$n$: number of queries
$m$: number of keys/values

Step 1: Dot Product Between Queries and Keys

$$\mathbf{S} = \mathbf{Q} \mathbf{K}^\top \in \mathbb{R}^{n \times m}$$

Each element:

$$S_{ij} = \mathbf{q}_i^\top \mathbf{k}_j$$

So:

$$\mathbf{S} = \begin{bmatrix} \mathbf{q}_1^\top \mathbf{k}_1 & \cdots & \mathbf{q}_1^\top \mathbf{k}_m \\ \vdots & \ddots & \vdots \\ \mathbf{q}_n^\top \mathbf{k}_1 & \cdots & \mathbf{q}_n^\top \mathbf{k}_m \end{bmatrix}$$

Step 2: Scale the Scores

$$\mathbf{S}' = \frac{\mathbf{S}}{\sqrt{d_k}} = \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \in \mathbb{R}^{n \times m}$$

Step 3: Apply Softmax

$$\mathbf{A} = \text{softmax}(\mathbf{S}') \in \mathbb{R}^{n \times m}$$

Each row of $\mathbf{A}$, denoted $\mathbf{a}_i \in \mathbb{R}^{m}$, contains attention weights for query $\mathbf{q}_i$:

$$ \mathbf{a}_i = \text{softmax}\left( \frac{\mathbf{q}_i^\top \mathbf{K}^\top}{\sqrt{d_k}} \right) $$

That is:

$$a_{ij} = \frac{\exp\left( \frac{\mathbf{q}_i^\top \mathbf{k}_j}{\sqrt{d_k}} \right)}{\sum\limits_{j'=1}^{m} \exp\left( \frac{\mathbf{q}_i^\top \mathbf{k}_{j'}}{\sqrt{d_k}} \right)}$$

Step 4: Multiply Attention Weights by Value Matrix

$$\mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{n \times d_v}$$

Step 5: Compute Each Output Vector

Each output vector $\mathbf{o}_i \in \mathbb{R}^{d_v}$ is a weighted sum of all value vectors $\mathbf{v}_j$, weighted by attention weights $a_{ij}$:

$$\mathbf{o}_i = \sum_{j=1}^{m} a_{ij} \mathbf{v}_j.$$

So the full output is:

$$ \mathbf{O} = \begin{bmatrix} \mathbf{o}_1^\top \\ \mathbf{o}_2^\top \\ \vdots \\ \mathbf{o}_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d_v}. $$

Clarification of Self-Attention and Cross-Attention

Self-Attention

Core Mechanism:Operates on a single input sequence. Queries (Q), keys (K), and values (V) are derived from the same source.
Purpose:Captures intra-sequence dependencies (e.g., relationships between agents in a scene or words in a sentence).

Cross-Attention

Core Mechanism:Queries (Q) come from one sequence, while keys (K) and values (V) come from another independent sequence.
Purpose:Enables cross-modal integration (e.g., fusing agent dynamics with map semantics or routing data).

Key Differences

Aspect	Self-Attention	Cross-Attention
Input Sources	Q, K, V from same sequence	Q from sequence A; K/V from sequence B
Primary Role	Intra-sequence relationship modeling	Inter-sequence information fusion

Perceiver [3]

Perceiver architecture diagram — Illustration of the Perceiver architecture, which uses a cross-attention mechanism to handle large and diverse inputs by mapping them to a smaller latent array [3].

Scene Encoder/ Perceiver Encoder

A PerceiverEncoder is designed to solve a different problem: handling extremely large inputs (like images, audio, or your map data) that are too big for standard self-attention.Its goal is not to enrich the giant input, but to distill it into a small, manageable, fixed-size latent array. To do this, two different things to interact:

Our input is too big for standard self-attention
The only way for the small latent array to “read” or “query” information from the large input data is through cross-attention.

Trajectory Decoder/Perceiver Decoder

Given learnable output query
Cross attention with the output of the encoder
Self attention

Output

$$\{\boldsymbol{\pi}_i, \boldsymbol{\mu}_i, \boldsymbol{\sigma}_i\}$$

where $\boldsymbol{\pi}_i$ is the mixing coefficient, $\boldsymbol{\mu}_i$ is the mean, and $\boldsymbol{\sigma}_i$ is the standard deviation.

Gaussian mixture

$$p(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\sigma})=\sum_i \pi_i\mathcal{N}(\boldsymbol{\mu}^i, \boldsymbol{\sigma}^i)$$

Mixture of Laplace Distributions

A Laplace distribution has the probability density function:

$$f(x \mid \mu, b) = \frac{1}{2b} \exp\left( -\frac{|x - \mu|}{b} \right),$$

predicted probability: $N_{mode}$
predicted trajectory: $N_{mode}\times N_T \times 5$, ( $\mu_x, \mu_y, s_x, s_y, \rho$) where the covariance matrix is:

$$\mathbf{\Sigma}= \begin{bmatrix} s_x^2 & \rho s_x s_y \\ \rho s_x s_y & s_y^2 \end{bmatrix}$$

Loss

The classification loss + the regression loss

Classification loss: select the predicted trajectory with ground truth and compute its corresponding cross entropy
Regression loss: minimize the negative likelihood loss for the selected Gaussian with ground truth trajectory

References

[1] N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion Forecasting via Simple & Efficient Attention Networks,” Jul. 12, 2022, arXiv: arXiv:2207.05844. doi: 10.48550/arXiv.2207.05844.

[2] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017.

[3] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira, “Perceiver IO: A General Architecture for Structured Inputs & Outputs,” CoRR, vol. abs/2107.14795, 2021. [Online]. Available: https://arxiv.org/abs/2107.14795

Training Overview#

Model#

Overview of the One-Stage E2E model#

Overview of the Two-Stage E2E model#

Details of the Two-Stage E2E Model#

Model Structure Overview#

Feature Embedding/Feature Projection#

Agent and ego#

Map#

Navigation#

Route#

Traffic Light#

Road Sign#

Road Arrow#

Ground Truth Trajectory#

Perceivers#

Standard Transformer#

Standard Encoder Decoder Structure#

Details in Attention#

Clarification of Self-Attention and Cross-Attention#

Perceiver [3]#

Scene Encoder/ Perceiver Encoder#

Trajectory Decoder/Perceiver Decoder#

Output#

Loss#

References#

Training Overview

Model

Overview of the One-Stage E2E model

Overview of the Two-Stage E2E model

Details of the Two-Stage E2E Model

Model Structure Overview

Feature Embedding/Feature Projection

Agent and ego

Map

Navigation

Route

Traffic Light

Road Sign

Road Arrow

Ground Truth Trajectory

Perceivers

Standard Transformer

Standard Encoder Decoder Structure

Details in Attention

Clarification of Self-Attention and Cross-Attention

Perceiver [3]

Scene Encoder/ Perceiver Encoder

Trajectory Decoder/Perceiver Decoder

Output

Loss

References