This post provides a technical deep dive into the Wayformer paper [1], a key publication in the field of motion forecasting.
Training Overview

Model
Overview of the One-Stage E2E model

Overview of the Two-Stage E2E model

Details of the Two-Stage E2E Model

Model Structure Overview


Feature Embedding/Feature Projection
$$\mathbf{f}\in \mathbb{R}^{T \times N\times D} \to \mathbf{x}_{input} \in \mathbb{R}^{(T \cdot N) \times d}$$Where $T$ is the number of time history, $N$ is the number of entities, $D$ is the number of features, and $d=256$.
- Feature projection
Where $\mathbf{W} \in \mathbb{R}^{D\times d}$ and $\mathbf{x}_{in} \in \mathbb{R}^{T\times N \times d}$
- Add time and position embedding
Where time embedding: $\mathbf{p}_t \in \mathbb{R}^{1 \times N \times d} $, position embedding: $\mathbf{p}_s \in \mathbb{R}^{T \times 1 \times d}$. They will broadcast the time and agent dimension respectively.
- Spatio-Temporal Feature Flattening
Agent and ego
$$\mathbb{R}^{11\times 32 \times 28}$$Features:
- Positions (3): x, y, z
- Dimensions (3): Length, Width, Height
- Object type (3): One-hot encoding of type (e.g., car, pedestrian, cyclist)
is_tracked
(1): A flag indicating if the object is tracked or predicted- Ego car indicator (1)
- Time embedding (11): e.g.,
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
indicates the state is from one step before the last time step. - Exact time (1)
- Heading (2):
cos(yaw)
andsin(yaw)
- Velocity (2): $v_x$, $v_y$
- Validity (1)
In the implementation, the dimensions are $T = 11$ (10 history + 1 current), $N = 32$ (31 agents + 1 ego), and $D=28$.
Map
The map is represented as hotmaps of segmentations.
$$\mathbf{x}^{mp} \in \mathbb{R}^{H\times W \times C},$$where $H=560, W=160, C=11$. The channel meanings are: 3 for segmentation type, 6 for lane shape, and 2 for lane color.
There are two ways to process the map features:
Patchify-Based
- Divide H and W into a grid of patches. This results in a tensor with shape $H/N^p \times W/N^p \times N^p \times N^p \times C$.
- Flatten each patch into a vector, resulting in a tensor with shape: $1 \times (H/N^p \times W/N^p) \times (N^p \times N^p \times C)$.
Image-Based (CNN)
- Start with a segmentation map of size: $560 \times 160 \times 11$.
- Process it through a series of CNN layers to get a feature map of size: $70 \times 20 \times 256$.
- Reshape this feature map into a sequence of tokens: $1 \times 1400 \times 256$.
Navigation
$$\mathbb{R}^{1\times 200 \times 2}$$Raw inputs: a list of waypoints from the navigation applications.
Requests based on current position from the RTK.
Process: find the first point within range of interests; transform waypoints into ego coordinate
Resample to a fixed number of points with fixed spacing: if the path is too short, pad with zeros; if the path is too long, retain the first pre-defined number of points
Change each point into the ego coordinate
Features:
- Positions: x, y in ego coordinate
Update:
Request new navigation from current RTK pose or retain from previously computed ones
Transform to ego coordinate according to relative pose
Route
$$\mathbb{R}^{1\times1500\times 2}$$Contents: similar to navigation but are global, longer and has more points.
For example, it has 50 points and 30m arc spacing between two points.
Comparison between route and navigation
Navigation | Route | |
---|---|---|
Points | 40 | 30 |
Spacing | 5m | 50m |
Traffic Light
$$\mathbf{x}\in \mathbb{R}^{1\times 10 \times 16}$$Features:
Positions (3): x, y, z
Types (9): left turn, right turn, straight, U-turn, etc
Color (3): red, yellow, green
Confidence Score
Road Sign
$$\mathbb{R}^{1\times 10 \times 14}$$Contents: stop, yield, speed limit, no entry, pedestrian crossing, turn restriction, etc.
Features:
Position (2): x, y
Bounding box corners (8)
Types (3): crosswalk, bump, no_parking_area
Confidence score (1)
Road Arrow
$$\mathbb{R}^{1\times 10 \times 20}$$Contents: directional arrows on the road surface that indicate allowed or recommended driving directions (such as turn left, go straight, turn right, etc.)
Features:
Center position (2): x, y
Direction (2) : cos, sin
Bounding box corners (8)
Type (7): left, right, straight, U-turn, etc
Score(1)
Ground Truth Trajectory
The next 5 or 8 seconds’ trajectories on the current coordinate:
$$\mathbb{R}^{50\times 2}$$The trajectory is from SLAM, RTK, or cyber pose. So, relative pose accuracy is important.
Perceivers
Standard Transformer
Standard Encoder Decoder Structure

Tokenize/Projecting the original input into feature vector: $\mathbf{x}_{in}$
Add position embedding feature to input: $\mathbf{x}_{input} = \mathbf{x}_{in} + \mathbf{x}_{pos}$
Map input into key, value, query: $\mathbf{x}_{input} \to \mathbf{Q}, \mathbf{K},\mathbf{V}$
Self attention in encoder
Starting with a given token, e.g., <start>, self attention, and cross attention with encoded in decoder
Details in Attention
$$ \boxed{ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \right) \mathbf{V} } $$We define:
$\mathbf{Q} \in \mathbb{R}^{n \times d_k}$, Query matrix $\mathbf{Q} = \begin{bmatrix} \mathbf{q}_1^\top \\ \mathbf{q}_2^\top \\ \mathbf{q}_3^\top \\ \vdots\\ \mathbf{q}_n^\top \\ \end{bmatrix} $
$\mathbf{K} \in \mathbb{R}^{m \times d_k}$, Key matrix $\mathbf{K} = \begin{bmatrix} \mathbf{k}_1^\top \\ \mathbf{k}_2^\top \\ \mathbf{k}_3^\top \\ \vdots\\ \mathbf{k}_m \end{bmatrix} $
$\mathbf{V} \in \mathbb{R}^{m \times d_v}$, Value matrix $\mathbf{V}= \begin{bmatrix} \mathbf{v}_1^\top \\ \mathbf{v}_2^\top \\ \mathbf{v}_3^\top \\ \vdots\\ \mathbf{v}_m^\top \\ \end{bmatrix} $
where
- $\mathbf{q}_i \in \mathbb{R}^{d_k}$, $\mathbf{k}_j \in \mathbb{R}^{d_k}$, $\mathbf{v}_j \in \mathbb{R}^{d_v}$
- $n$: number of queries
- $m$: number of keys/values
Step 1: Dot Product Between Queries and Keys
$$\mathbf{S} = \mathbf{Q} \mathbf{K}^\top \in \mathbb{R}^{n \times m}$$Each element:
$$S_{ij} = \mathbf{q}_i^\top \mathbf{k}_j$$So:
$$\mathbf{S} = \begin{bmatrix} \mathbf{q}_1^\top \mathbf{k}_1 & \cdots & \mathbf{q}_1^\top \mathbf{k}_m \\ \vdots & \ddots & \vdots \\ \mathbf{q}_n^\top \mathbf{k}_1 & \cdots & \mathbf{q}_n^\top \mathbf{k}_m \end{bmatrix}$$Step 2: Scale the Scores
$$\mathbf{S}' = \frac{\mathbf{S}}{\sqrt{d_k}} = \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \in \mathbb{R}^{n \times m}$$Step 3: Apply Softmax
$$\mathbf{A} = \text{softmax}(\mathbf{S}') \in \mathbb{R}^{n \times m}$$Each row of $\mathbf{A}$, denoted $\mathbf{a}_i \in \mathbb{R}^{m}$, contains attention weights for query $\mathbf{q}_i$:
$$ \mathbf{a}_i = \text{softmax}\left( \frac{\mathbf{q}_i^\top \mathbf{K}^\top}{\sqrt{d_k}} \right) $$That is:
$$a_{ij} = \frac{\exp\left( \frac{\mathbf{q}_i^\top \mathbf{k}_j}{\sqrt{d_k}} \right)}{\sum\limits_{j'=1}^{m} \exp\left( \frac{\mathbf{q}_i^\top \mathbf{k}_{j'}}{\sqrt{d_k}} \right)}$$Step 4: Multiply Attention Weights by Value Matrix
$$\mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{n \times d_v}$$Step 5: Compute Each Output Vector
Each output vector $\mathbf{o}_i \in \mathbb{R}^{d_v}$ is a weighted sum of all value vectors $\mathbf{v}_j$, weighted by attention weights $a_{ij}$:
$$\mathbf{o}_i = \sum_{j=1}^{m} a_{ij} \mathbf{v}_j.$$So the full output is:
$$ \mathbf{O} = \begin{bmatrix} \mathbf{o}_1^\top \\ \mathbf{o}_2^\top \\ \vdots \\ \mathbf{o}_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d_v}. $$Clarification of Self-Attention and Cross-Attention
- Self-Attention
Core Mechanism:Operates on a single input sequence. Queries (Q), keys (K), and values (V) are derived from the same source.
Purpose:Captures intra-sequence dependencies (e.g., relationships between agents in a scene or words in a sentence).
- Cross-Attention
Core Mechanism:Queries (Q) come from one sequence, while keys (K) and values (V) come from another independent sequence.
Purpose:Enables cross-modal integration (e.g., fusing agent dynamics with map semantics or routing data).
Key Differences
Aspect | Self-Attention | Cross-Attention |
---|---|---|
Input Sources | Q, K, V from same sequence | Q from sequence A; K/V from sequence B |
Primary Role | Intra-sequence relationship modeling | Inter-sequence information fusion |
Perceiver [3]

Scene Encoder/ Perceiver Encoder
A PerceiverEncoder is designed to solve a different problem: handling extremely large inputs (like images, audio, or your map data) that are too big for standard self-attention.Its goal is not to enrich the giant input, but to distill it into a small, manageable, fixed-size latent array. To do this, two different things to interact:
Our input is too big for standard self-attention
The only way for the small latent array to “read” or “query” information from the large input data is through cross-attention.
Trajectory Decoder/Perceiver Decoder
Given learnable output query
Cross attention with the output of the encoder
Self attention
Output
$$\{\boldsymbol{\pi}_i, \boldsymbol{\mu}_i, \boldsymbol{\sigma}_i\}$$where $\boldsymbol{\pi}_i$ is the mixing coefficient, $\boldsymbol{\mu}_i$ is the mean, and $\boldsymbol{\sigma}_i$ is the standard deviation.
Gaussian mixture
$$p(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\sigma})=\sum_i \pi_i\mathcal{N}(\boldsymbol{\mu}^i, \boldsymbol{\sigma}^i)$$Or
Mixture of Laplace Distributions
A Laplace distribution has the probability density function:
$$f(x \mid \mu, b) = \frac{1}{2b} \exp\left( -\frac{|x - \mu|}{b} \right),$$predicted probability: $N_{mode}$
predicted trajectory: $N_{mode}\times N_T \times 5$, ( $\mu_x, \mu_y, s_x, s_y, \rho$) where the covariance matrix is:
Loss
The classification loss + the regression loss
Classification loss: select the predicted trajectory with ground truth and compute its corresponding cross entropy
Regression loss: minimize the negative likelihood loss for the selected Gaussian with ground truth trajectory
References
[1] N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion Forecasting via Simple & Efficient Attention Networks,” Jul. 12, 2022, arXiv: arXiv:2207.05844. doi: 10.48550/arXiv.2207.05844.
[2] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017.
[3] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira, “Perceiver IO: A General Architecture for Structured Inputs & Outputs,” CoRR, vol. abs/2107.14795, 2021. [Online]. Available: https://arxiv.org/abs/2107.14795