Understanding PaTH Attention

This is a post that walks through my understanding of an awesome paper by my friend Songlin, PaTH Attention: Position Encoding via Accumulating Householder Transformations. She presented it during our seminar’s summer bootcamp, make sure to check that out as well (YouTube).

Many of you would know RoPE - which allows us to encode positions.

\ell_{ij} \;=\; q_i^\top\, T^{\mathrm{RoPE}}_{i\leftarrow j}\, k_j \quad T^{\mathrm{RoPE}}_{i\leftarrow j} \;=\; R^{\,i-j}. \\ \tilde q_i \;=\; R^{-i} q_i,\quad\tilde k_j \;=\; R^{-j} k_j

Its interesting to think about complexity classes now. The above equations are fully parallelizable, meaning that no matter how many tokens are in the sequence, the depth of the computation graph is constant $\text{TC}^0$ . Now why is this important?

Well, consider a state tracking problem in $S_4$ where we need to output the group element after a series of permutations. For example, let $x_1 = (1,2)(3,4)$ and $x_2 = (1,3)(2,4)$ , both elements of $S_4$ . The desired output after the composition $x_1 \cdot x_2 = (1,4)(2,3)$ . This problem can’t be done with a model that has a fixed depth computational graph, and in-fact it needs a computational graph whose depth grows log of the input size $\text{NC}^1$ .

So what models can do this - well certain recurrent models can do it, check out my previous blog post. So how can we modify transformers which are currently in $\text{TC}^0$ to do state tracking? We need to somewhere modify the computational graph such that the depth grows as we have more inputs. Revisiting the idea of position embeddings, instead of RoPE which has matrix $R^{i-j}$ , what if we have a function whose depth grows as the distance $i-j$ grows. Naturally, the idea of having not just a single rotation matrix $R$ but a composition seems reasonable, but actually a cool math fact is:

Householder transformations are strictly more expressive than a single rotation matrix. While a rotation matrix belongs to the special orthogonal group $SO(n)$ (determinant +1), Householders generate the entire orthogonal group $O(n)$ , which includes both rotations and reflections. In fact, any rotation matrix can be written as a product of Householders, but not vice versa—so working with them gives us a strictly larger space of transformations.

Following this we arrive at the PaTH formulation:

o_t = \frac{1}{Z_t}\sum_{j=1}^t v_j \exp\!\Big( k_j^\top \left( \prod_{s=j+1}^t \mathbf{H}_s \right) q_t \Big)

$Z_t$ is the softmax normalization. $\mathbf{H}_t = \mathbf{I} - \beta_tw_tw_t^\top$ , and $w_t$ is some function of token representation $x_t$ .

This still can still retain the associative recall capabilities of quadratic transformers because unlike linear transformers we have the non-linear softmax that prevents us collapsing the attention into a single state matrix. Note: I am skipping a large chunk of the paper which goes over efficient training - my main interest is the capabilities and the formulation of the model - but Section 3 is also critical.

Now we will now show how we can solve a state tracking $\text{NC}^1$ problem. So in Appendix A: Representation Power of Transformers with PaTH Attention, we see how the PaTH attention block can solve a swapping tasks: given the permutation group of $5$ elements $S_5$ , consider a sequence of tokens that denotes the elements (swaps). The sequence is constructed in the following form:

\# [a_1 \leftrightarrow b_1] [a_2 \leftrightarrow b_2] \ldots [a_n \leftrightarrow b_n]

The # is the start token and each one of the bracket terms are another token that denotes a particular swap action, thus including the # there are 21 tokens in total, each have a unique one hot vector $u$ . I am going to now go a bit out of order in how the explain things (just to write down my understanding). First we define the Householder weight $W_w$ .

W_wu = w[u] = \frac{(e_x - e_y)}{\sqrt{2}} \quad \text{for} \, v = [x \leftrightarrow y], \, \text{and} \, \, 0 \, \, \text{if} \, v=\#

$e_x, e_y$ are basis vectors, and when we define now the house holder transformation, we can see how it performs a swap operation.

H = I - 2ww^\top, \qquad w = \tfrac{e_x - e_y}{\sqrt{2}}.

He_x = e_x - 2w(w^\top e_x) = e_x - 2w\!\left(\tfrac{1}{\sqrt{2}}\right) = e_x - (e_x - e_y) = e_y.

He_y = e_y - 2w(w^\top e_y)     = e_y - 2w\!\left(-\tfrac{1}{\sqrt{2}}\right)     = e_y + (e_x - e_y)     = e_x.

For $j \notin \{x,y\}$ :

He_j = e_j - 2w(w^\top e_j) = e_j.

Thus, it just swaps $e_x$ and $e_y$ while preserving the rest in place. Cool!

\begin{align*} W_k u &= k[u] = \mathbf{1}\{u = \#\}(e_1 + 2e_2 + 3e_3 + 4e_4 + 5e_5 - e_6), \\[6pt] W_q u &= q[u] = n(e_1 + 2e_2 + 3e_3 + 4e_4 + 5e_5 + 54.5e_6), \\[6pt] W_w u &= w[u] = (e_x - e_y)/\sqrt{2} \ \text{for } v = [x \leftrightarrow y], \ \text{and } 0 \text{ if } v = \#, \\[6pt] W_v u &= v[u] = \mathbf{1}\{u = \#\} e_1, \\[6pt] \beta &= 2. \end{align*}

These are all the parameters defined. So the only key and value that are non-zero are for the first token #, then if use the PaTH formulation to get the output for the last token, we obtain:

s_0 \;=\; k_0^{\top}\!\Bigg(\prod_{s=1}^{n} H_s\Bigg) q_n \;=\; n\left(\sum_{i=1}^{5} i\,\pi(i) - 54.5\right).

where $\pi(i)$ is i’th element of the vector after all the permutations/swaps were applied to the original vector $[1,2,3,4,5]$ . So when $\pi$ is the identity after all the swaps, then $s_0 > 0$ , and otherwise $s_0 < 0$ . There is more to it to make this a final prediction of $1$ or $-1$ , but this is the essence of it and I will end here.

Final thoughts. I think this paper is very interesting because it added the extra computation in the positional encoding, showing how we do more complex transformations of the tokens when comparing them for attention. It is also interesting to think about what are the other axis we can efficiently inject more computation in the transformer.