Meditations

Self-Attention 1, Raschka

In this post, I will be documenting my understanding of self-attention as described in Raschka, (2025)'s text book Build a Large Language Model.

Computing self-attention without trainable weights

First, we'll define the simple self-attention mechanism without trainable weights. This corresponds to section 3.3.1 in Raschka, (2025). This process will be comprised of three steps: (1) compute self-attention, (2) normalize, and (3) compute the context vector. These steps will make sense as we build them up and define them explicitly.

We'll begin with the input vectors:

import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
  [0.55, 0.87, 0.66],  # journey  (x^2)
  [0.57, 0.85, 0.64],  # starts   (x^3)
  [0.22, 0.58, 0.33],  # with     (x^4)
  [0.77, 0.25, 0.10],  # one      (x^5)
  [0.05, 0.80, 0.55]]  # step     (x^6)
)

Here, we have six tokens embedded as three-dimensional vectors. The tokens (words) are encoded in this space. Words with similar meaning will have similar values, thus we can compute similarity using a dot product. This brings us to our first step.

1. Compute self-attention scores for each token

For each token, we can compute a similarity score with every other vector in the input as a dot product.

For example, for the second token 'journey' (indexed by 0), we can compute its similarity to the third token 'starts'. These two tokens are the most similar in this input.

score = torch.dot(input[2], input[3]) # returns 1.475

I'm uncertain, at the moment, what the scale of this number means. But, this is likely addressed when we enter the normalization stage. Before this, let's generalize and compute this score for every input token against every other token. We can do this simply through a matrix multiplication:

# Generalize to compute normalized attention weights for all inputs
attn_scores = inputs @ inputs.T

This gives us the tensor below. As we can see, we've created a symmetric distance-like matrix. The (2,3) and (3,2) items correspond to the second-to-third similarity score we computed earlier.

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

2. Normalize the attention scores

Next, we will normalize these scores using a soft-max. Normalizing ensures training stability and interpretability.

normalized_attn_weights = torch.softmax(attn_scores, dim=-1)
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

Our aim is that each row will sum to 1. The dim=-1 ensures that the soft-max collapses each column. We could divide each row by the sum, but the soft-max handles numbers close to zero or very large numbers in a more stable manner than strict division.

3. Compute the context-vectors

Finally, we compute the context vectors as a weighted sum between the normalized attention scores and the input vector.

context_vectors = normalized_attn_weights @ inputs

The final vectors are the input vectors scaled by the contextual information of all other tokens in the input.

 tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

Closing

This step-by-step procedure allows us to understand the simple self-attention mechanism. By breaking down how the input vectors are used, we gain an intuitive sense on how similarity is encoded into the dot product and