This post provides a technical deep dive into the Wayformer paper [1], a key publication in the field of motion forecasting.

Training Overview

Wayformer training pipeline
An overview of the deep learning training pipeline, illustrating the data flow and key components involved during model training.

Model

Overview of the One-Stage E2E model

One staged E2E model
One staged E2E model.

Overview of the Two-Stage E2E model

Two staged E2E model
Two staged E2E model.

Details of the Two-Stage E2E Model

Overview of the Wayformer model
Overview of the Wayformer model.

Model Structure Overview

Wayformer Figure 1
(a)
Wayformer Figure 2
(b)
The left figure shows the encoder and decoder of the Wayformer model. The right figure shows the details of the encoder [1].

Feature Embedding/Feature Projection

$$\mathbf{f}\in \mathbb{R}^{T \times N\times D} \to \mathbf{x}_{input} \in \mathbb{R}^{(T \cdot N) \times d}$$

Where $T$ is the number of time history, $N$ is the number of entities, $D$ is the number of features, and $d=256$.

  1. Feature projection
$$\mathbf{x}_{in} = \mathbf{f} \mathbf{W}$$

Where $\mathbf{W} \in \mathbb{R}^{D\times d}$ and $\mathbf{x}_{in} \in \mathbb{R}^{T\times N \times d}$

  1. Add time and position embedding
$$\mathbf{x}_{input} = \mathbf{x}_{in} + \mathbf{p}_t + \mathbf{p}_s$$

Where time embedding: $\mathbf{p}_t \in \mathbb{R}^{1 \times N \times d} $, position embedding: $\mathbf{p}_s \in \mathbb{R}^{T \times 1 \times d}$. They will broadcast the time and agent dimension respectively.

  1. Spatio-Temporal Feature Flattening
$$\mathbb{R}^{T\times N \times d} \to \mathbb{R}^{(T\cdot N) \times d}$$

Agent and ego

$$\mathbb{R}^{11\times 32 \times 28}$$

Features:

  1. Positions (3): x, y, z
  2. Dimensions (3): Length, Width, Height
  3. Object type (3): One-hot encoding of type (e.g., car, pedestrian, cyclist)
  4. is_tracked (1): A flag indicating if the object is tracked or predicted
  5. Ego car indicator (1)
  6. Time embedding (11): e.g., [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] indicates the state is from one step before the last time step.
  7. Exact time (1)
  8. Heading (2): cos(yaw) and sin(yaw)
  9. Velocity (2): $v_x$, $v_y$
  10. Validity (1)

In the implementation, the dimensions are $T = 11$ (10 history + 1 current), $N = 32$ (31 agents + 1 ego), and $D=28$.

Map

The map is represented as hotmaps of segmentations.

$$\mathbf{x}^{mp} \in \mathbb{R}^{H\times W \times C},$$

where $H=560, W=160, C=11$. The channel meanings are: 3 for segmentation type, 6 for lane shape, and 2 for lane color.

There are two ways to process the map features:

Patchify-Based

  1. Divide H and W into a grid of patches. This results in a tensor with shape $H/N^p \times W/N^p \times N^p \times N^p \times C$.
  2. Flatten each patch into a vector, resulting in a tensor with shape: $1 \times (H/N^p \times W/N^p) \times (N^p \times N^p \times C)$.

Image-Based (CNN)

  1. Start with a segmentation map of size: $560 \times 160 \times 11$.
  2. Process it through a series of CNN layers to get a feature map of size: $70 \times 20 \times 256$.
  3. Reshape this feature map into a sequence of tokens: $1 \times 1400 \times 256$.
$$\mathbb{R}^{1\times 200 \times 2}$$

Raw inputs: a list of waypoints from the navigation applications.

Requests based on current position from the RTK.

Process: find the first point within range of interests; transform waypoints into ego coordinate

  1. Resample to a fixed number of points with fixed spacing: if the path is too short, pad with zeros; if the path is too long, retain the first pre-defined number of points

  2. Change each point into the ego coordinate

Features:

  1. Positions: x, y in ego coordinate

Update:

  1. Request new navigation from current RTK pose or retain from previously computed ones

  2. Transform to ego coordinate according to relative pose

Route

$$\mathbb{R}^{1\times1500\times 2}$$

Contents: similar to navigation but are global, longer and has more points.

For example, it has 50 points and 30m arc spacing between two points.

Comparison between route and navigation

NavigationRoute
Points4030
Spacing5m50m

Traffic Light

$$\mathbf{x}\in \mathbb{R}^{1\times 10 \times 16}$$

Features:

  1. Positions (3): x, y, z

  2. Types (9): left turn, right turn, straight, U-turn, etc

  3. Color (3): red, yellow, green

  4. Confidence Score

Road Sign

$$\mathbb{R}^{1\times 10 \times 14}$$

Contents: stop, yield, speed limit, no entry, pedestrian crossing, turn restriction, etc.

Features:

  1. Position (2): x, y

  2. Bounding box corners (8)

  3. Types (3): crosswalk, bump, no_parking_area

  4. Confidence score (1)

Road Arrow

$$\mathbb{R}^{1\times 10 \times 20}$$

Contents: directional arrows on the road surface that indicate allowed or recommended driving directions (such as turn left, go straight, turn right, etc.)

Features:

  1. Center position (2): x, y

  2. Direction (2) : cos, sin

  3. Bounding box corners (8)

  4. Type (7): left, right, straight, U-turn, etc

  5. Score(1)

Ground Truth Trajectory

The next 5 or 8 seconds’ trajectories on the current coordinate:

$$\mathbb{R}^{50\times 2}$$

The trajectory is from SLAM, RTK, or cyber pose. So, relative pose accuracy is important.

Perceivers

Standard Transformer

Standard Encoder Decoder Structure

Attention mechanism illustration
Illustration of the attention mechanism in the Transformer architecture [2].
  1. Tokenize/Projecting the original input into feature vector: $\mathbf{x}_{in}$

  2. Add position embedding feature to input: $\mathbf{x}_{input} = \mathbf{x}_{in} + \mathbf{x}_{pos}$

  3. Map input into key, value, query: $\mathbf{x}_{input} \to \mathbf{Q}, \mathbf{K},\mathbf{V}$

  4. Self attention in encoder

  5. Starting with a given token, e.g., <start>, self attention, and cross attention with encoded in decoder

Details in Attention

$$ \boxed{ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \right) \mathbf{V} } $$

We define:

  • $\mathbf{Q} \in \mathbb{R}^{n \times d_k}$, Query matrix $\mathbf{Q} = \begin{bmatrix} \mathbf{q}_1^\top \\ \mathbf{q}_2^\top \\ \mathbf{q}_3^\top \\ \vdots\\ \mathbf{q}_n^\top \\ \end{bmatrix} $

  • $\mathbf{K} \in \mathbb{R}^{m \times d_k}$, Key matrix $\mathbf{K} = \begin{bmatrix} \mathbf{k}_1^\top \\ \mathbf{k}_2^\top \\ \mathbf{k}_3^\top \\ \vdots\\ \mathbf{k}_m \end{bmatrix} $

  • $\mathbf{V} \in \mathbb{R}^{m \times d_v}$, Value matrix $\mathbf{V}= \begin{bmatrix} \mathbf{v}_1^\top \\ \mathbf{v}_2^\top \\ \mathbf{v}_3^\top \\ \vdots\\ \mathbf{v}_m^\top \\ \end{bmatrix} $

where

  • $\mathbf{q}_i \in \mathbb{R}^{d_k}$, $\mathbf{k}_j \in \mathbb{R}^{d_k}$, $\mathbf{v}_j \in \mathbb{R}^{d_v}$
  • $n$: number of queries
  • $m$: number of keys/values

Step 1: Dot Product Between Queries and Keys

$$\mathbf{S} = \mathbf{Q} \mathbf{K}^\top \in \mathbb{R}^{n \times m}$$

Each element:

$$S_{ij} = \mathbf{q}_i^\top \mathbf{k}_j$$

So:

$$\mathbf{S} = \begin{bmatrix} \mathbf{q}_1^\top \mathbf{k}_1 & \cdots & \mathbf{q}_1^\top \mathbf{k}_m \\ \vdots & \ddots & \vdots \\ \mathbf{q}_n^\top \mathbf{k}_1 & \cdots & \mathbf{q}_n^\top \mathbf{k}_m \end{bmatrix}$$

Step 2: Scale the Scores

$$\mathbf{S}' = \frac{\mathbf{S}}{\sqrt{d_k}} = \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \in \mathbb{R}^{n \times m}$$

Step 3: Apply Softmax

$$\mathbf{A} = \text{softmax}(\mathbf{S}') \in \mathbb{R}^{n \times m}$$

Each row of $\mathbf{A}$, denoted $\mathbf{a}_i \in \mathbb{R}^{m}$, contains attention weights for query $\mathbf{q}_i$:

$$ \mathbf{a}_i = \text{softmax}\left( \frac{\mathbf{q}_i^\top \mathbf{K}^\top}{\sqrt{d_k}} \right) $$

That is:

$$a_{ij} = \frac{\exp\left( \frac{\mathbf{q}_i^\top \mathbf{k}_j}{\sqrt{d_k}} \right)}{\sum\limits_{j'=1}^{m} \exp\left( \frac{\mathbf{q}_i^\top \mathbf{k}_{j'}}{\sqrt{d_k}} \right)}$$

Step 4: Multiply Attention Weights by Value Matrix

$$\mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{n \times d_v}$$

Step 5: Compute Each Output Vector

Each output vector $\mathbf{o}_i \in \mathbb{R}^{d_v}$ is a weighted sum of all value vectors $\mathbf{v}_j$, weighted by attention weights $a_{ij}$:

$$\mathbf{o}_i = \sum_{j=1}^{m} a_{ij} \mathbf{v}_j.$$

So the full output is:

$$ \mathbf{O} = \begin{bmatrix} \mathbf{o}_1^\top \\ \mathbf{o}_2^\top \\ \vdots \\ \mathbf{o}_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d_v}. $$

Clarification of Self-Attention and Cross-Attention

  1. Self-Attention
  • Core Mechanism:Operates on a single input sequence. Queries (Q), keys (K), and values (V) are derived from the same source.

  • Purpose:Captures intra-sequence dependencies (e.g., relationships between agents in a scene or words in a sentence).

  • Cross-Attention
  • Core Mechanism:Queries (Q) come from one sequence, while keys (K) and values (V) come from another independent sequence.

  • Purpose:Enables cross-modal integration (e.g., fusing agent dynamics with map semantics or routing data).

Key Differences

AspectSelf-AttentionCross-Attention
Input SourcesQ, K, V from same sequenceQ from sequence A; K/V from sequence B
Primary RoleIntra-sequence relationship modelingInter-sequence information fusion

Perceiver [3]

Perceiver architecture diagram
Illustration of the Perceiver architecture, which uses a cross-attention mechanism to handle large and diverse inputs by mapping them to a smaller latent array [3].

Scene Encoder/ Perceiver Encoder

A PerceiverEncoder is designed to solve a different problem: handling extremely large inputs (like images, audio, or your map data) that are too big for standard self-attention.Its goal is not to enrich the giant input, but to distill it into a small, manageable, fixed-size latent array. To do this, two different things to interact:

  1. Our input is too big for standard self-attention

  2. The only way for the small latent array to “read” or “query” information from the large input data is through cross-attention.

Trajectory Decoder/Perceiver Decoder

  1. Given learnable output query

  2. Cross attention with the output of the encoder

  3. Self attention

Output

$$\{\boldsymbol{\pi}_i, \boldsymbol{\mu}_i, \boldsymbol{\sigma}_i\}$$

where $\boldsymbol{\pi}_i$ is the mixing coefficient, $\boldsymbol{\mu}_i$ is the mean, and $\boldsymbol{\sigma}_i$ is the standard deviation.

Gaussian mixture

$$p(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\sigma})=\sum_i \pi_i\mathcal{N}(\boldsymbol{\mu}^i, \boldsymbol{\sigma}^i)$$

Or

Mixture of Laplace Distributions

A Laplace distribution has the probability density function:

$$f(x \mid \mu, b) = \frac{1}{2b} \exp\left( -\frac{|x - \mu|}{b} \right),$$
  • predicted probability: $N_{mode}$

  • predicted trajectory: $N_{mode}\times N_T \times 5$, ( $\mu_x, \mu_y, s_x, s_y, \rho$) where the covariance matrix is:

$$\mathbf{\Sigma}= \begin{bmatrix} s_x^2 & \rho s_x s_y \\ \rho s_x s_y & s_y^2 \end{bmatrix}$$

Loss

The classification loss + the regression loss

  1. Classification loss: select the predicted trajectory with ground truth and compute its corresponding cross entropy

  2. Regression loss: minimize the negative likelihood loss for the selected Gaussian with ground truth trajectory

References

[1] N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion Forecasting via Simple & Efficient Attention Networks,” Jul. 12, 2022, arXiv: arXiv:2207.05844. doi: 10.48550/arXiv.2207.05844.

[2] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017.

[3] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira, “Perceiver IO: A General Architecture for Structured Inputs & Outputs,” CoRR, vol. abs/2107.14795, 2021. [Online]. Available: https://arxiv.org/abs/2107.14795