Recurrent Networks and Test Time Training (TTT)

Notes on Interesting Papers on Recurrent Networks and their connection to Test Time Training (TTT)

Songlin has a great slidedeck on most of the papers I will go through here. I am just writing this for my own learning purposes.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

This paper presents an interesting perspective on how to update the hidden state of a recurrent model. Let's start by reviewing the traditional recurrent network structure:

W_t = g(x_t, W_{t-1}), \quad z_t = f(x_t;W_t)

Then the idea is to make the hidden state a have its own optimization process. Therefore, on test time when a new token comes in $x_t$ , then instead of just transforming the hidden state with a traditional fixed function, we will obvisouly still have some function to update it but that function is going to be the gradient update of some loss function $\ell$ .

W_t = g(x_t, W_{t-1}) = W_{t-1} - \eta \nabla \ell(W_{t-1};x_t)

Then then question is how can we construct such as loss function. The ones the author investigate takes the following form:

\begin{aligned} \ell(W;x_t) &= \lVert f(\theta_kx_t;W)-\theta_Vx_t \rVert^2 \\ z_t &= f(\theta_Qx_t;W_t) \end{aligned}

Here, the $\theta$ parameters project $x_t$ and are trainable. This formulation necessitates two optimization loops:

Inner loop: Optimizes the meta-weights $W$ (which can be viewed as the weights of $f$ )

Outer loop: Optimizes all remaining parameters and $\theta$ values

This dual-loop structure elegantly avoids the need to backpropagate through the same variable twice, thus circumventing the computation of Hessians.

Parallelizing Linear Transformations with Delta Rule over Sequence Length

Starting with the preliminaries: A single-head softmax attention:

q_t, k_t, v_t = W_Qx_t, W_Kx_t, W_Vx_t, \quad o_t = \sum_{i=1}^t\frac{\exp(k_i^\top q_t)}{\sum_{j=1}^t \exp(k_j^\top q_t)}v_i

We can view $\exp(k_i^\top q_t)$ as a kernel and replace this with $\phi(k_i)^\top\phi(q_t)$ , where $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^n$ . As $n \rightarrow \infty$ , we can create a feature map based on the Taylor series expansion of $\exp(k_i^\top q_t)$ , where:

\exp(x) = \sum_{n=0}^{\infty}\frac{x^n}{n!}

This results in:

o_t = \sum_{i=1}^{t}\frac{\phi(k_i)^\top\phi(q_t)}{\sum_{j=1}^{t}\phi(k_j)^\top \phi(q_t)}v_i = \frac{(\sum_{i=1}^t v_i \phi(k_i)^\top)\phi(q_t)}{(\sum_{j=1}^t\phi(k_j)^\top)\phi(q_t)}

For each time t , if we ignore the denominator $(\sum_{j=1}^t\phi(k_j)^\top)\phi(q_t)$ and assume $\phi$ is the identity mapping, we can derive a recurrent formulation:

S_t = S_{t-1} + v_tk_t^\top, \quad o_t = S_tq_t

Here, S is updated with a rank-1 matrix $v_tk_t^\top$ , and then $q_t$ is transformed to yield $o_t$ .

Optimizing S with the Delta Rule

To ensure $S_tq_t$ is close to $v_t$ when $q_t$ is close to $k_t$ , we can define the optimization problem:

\mathcal{L}_t(S) = \frac{1}{2}\left\lVert Sk_t - v_t \right\rVert^2

The update can then be modified with a gradient step:

S_t = S_{t-1} - \beta_t \nabla_{S_{t-1}}\mathcal{L}_t(S_{t-1}) = S_{t-1} - \beta_t(S_{t-1}k_t - v_t)k_t^\top \\= S_{t-1}(I - \beta_tk_tk_t^\top)+\beta_tv_tk_t^\top

This Delta Rule allows $o_t = S_tq_t$ to approximate $v_t$ when $q_t$ is close to $k_t$ , while ensuring past key-value pairs are preserved.

Titans: Learning to Memorize at Test Time

By adding a momentum term to the gradient descent view we can obtain the following modified update:

\begin{aligned}G_t &= \eta_t G_{t-1} - \theta_t\nabla\mathcal{L}_t(S_{t-1} | x_t) \\ S_t &= S_{t-1} + G_t \end{aligned}

In the paper they note that $\eta_t$ and $\theta_t$ are data dependent controlling how much responsive the updates.

Gated Delta Networks: Improving Mamba2 with Delta Rule

By introducing a data dependent $\alpha_t \in (0,1)$ , there is greater modulation in the updates.

S_t = S_{t-1}(\alpha_t(I - \beta_tk_tk_t^\top)) + \beta_tv_tk_t^\top

But the question I had is why is it on the term $I - \beta_tk_tk_t^\top$ ? An interepretation of the Delta-Net formulation is

\begin{aligned} S_{t} &= S_{t-1} - v_t^{\text{old}}k_t^\top + v_t^{\text{new}}k_t^\top \\ S_{t} &= S_{t-1} - (S_{t-1}k_t)k_t^\top + v_tk_t^\top \end{aligned}

You can see that the interpretation of removing the prior key-value association is what the $\alpha$ is modulating.

Test-Time Regression: A Unifying Framework for Designing Sequence Models with Associative Memory

This paper provides a general perspective on associative memory. The general optimization objective is:

\min_{m \in \mathcal{M}}\sum_{i=1}^T\frac{1}{2}\gamma_i\left \lVert v_i- m(k_i) \right \rVert^2_2

In this framework, we generalize the function that processes keys. Instead of $Sk_i$ , we use $m(k_i)$ . For an analytical solution:

M_t = \argmin_{M} \frac{1}{2}\sum_{i=1}^t \left \lVert Mk_i - v_i \right \rVert_2^2

Assume $v_i \in \mathbb{R}^{d_v}, k_i \in \mathbb{R}^{d_k}$ , so $M_t \in \mathbb{R}^{d_v \times d_k}$ . The gradient becomes zero when:

\begin{aligned} \nabla_M &= \sum_{i=1}^t (M k_i - v_i) k_i^\top \\ \nabla_M &= (M K - V) K^\top \\ M &= V K^\top (K K^\top)^{-1} \end{aligned}

If $KK^\top$ isn't invertible, use the pseudo-inverse. In addition , if $K^\top K = I$ , we recover vanilla linear attention:

o_t = VK^\top q_t

There is more to this paper that I highly recommend checking out that explores different forms, such as what happens when we introduce softmax and approximate it as non-parametric regression. I think a fun investigation is to see what happens when we do a taylor expansion of the $\exp$ function and see how good of an approximation can we get with a few orders to concert a softmax attention to a linear one.