Chapter 4 — Why this site exists

Matrices in Artificial Intelligence

Strip away the mystique and modern AI is one operation, repeated trillions of times: matrix multiplication. Here is exactly where, and why.

~98%
of the compute in training a large language model is matrix multiplication
Wx + b
the equation inside every layer of every neural network
QKᵀ
the transposed matrix product at the heart of the transformer

1. A neural network layer is a matrix

A "layer" of a neural network with \(n\) inputs and \(m\) outputs is nothing more than an \(m \times n\) weight matrix \(W\), a bias vector \(b\), and a nonlinear squashing function:

$$y = f(Wx + b)$$

The "knowledge" of the network — everything it learned from data — lives in the numbers inside \(W\). When you read that a model has 70 billion parameters, that means its weight matrices contain 70 billion entries. Training a network means nudging those matrix entries, step by step, until the outputs match reality. Running a network (inference) means doing the multiplications in the equation above, layer after layer.

Every concept here came from Chapter 1: \(Wx\) is matrix–vector multiplication, the shape rule dictates the architecture (a layer mapping 4,096 inputs to 11,008 outputs is an 11,008 × 4,096 matrix), and stacking layers is matrix composition.

2. Words become vectors, vocabularies become matrices

Language models cannot read text; they read numbers. Every word (token) is assigned a vector — a list of, say, 4,096 numbers — and the full vocabulary forms an embedding matrix: 100,000 tokens × 4,096 dimensions. Meaning lives in geometry: words used in similar contexts end up as nearby vectors, and famously, king − man + woman ≈ queen is literal vector arithmetic on rows of this matrix.

A sentence becomes a matrix too — one row per token. That matrix is what flows through the network.

3. Attention: the transpose earns its keep

The transformer architecture behind ChatGPT, Claude, and Gemini is built on one formula — and it is pure Chapter 1 material:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Read it as matrix operations:

PieceWhat it is
Q, K, VQuery, Key, Value matrices — each produced by multiplying the input by a learned weight matrix
QKᵀA matrix times a transpose: every token's query dotted against every token's key, scoring how much each word should attend to every other word
softmax(·)Normalizes each row of scores into percentages
(·)VOne final matrix multiplication blends the value vectors according to those percentages

That is the mechanism that lets a model know "it" refers to "the cat" three sentences back. Three matrix multiplications and a transpose.

4. Images are matrices, literally

A grayscale photo is a matrix of brightness values — a 1080×1920 image is a 1080×1920 matrix. A color image is three stacked matrices (red, green, blue channels). Convolutional neural networks slide small filter matrices across the image, and under the hood frameworks convert even that into one giant matrix multiplication (the im2col trick) because GPUs do matmul so well.

5. PCA: eigenvectors compress reality

Real datasets have hundreds of correlated features. Principal Component Analysis finds the directions that actually matter — and it is a direct application of Chapter 3:

$$C = \frac{1}{n}X^TX \qquad \text{then find} \qquad Cv = \lambda v$$

The eigenvectors of the data's covariance matrix are the directions of maximum variance; the eigenvalues say how much variance each direction carries. Keep the top few eigenvectors and you compress a 1,000-dimensional dataset to 50 dimensions while preserving most of its structure. The same eigendecomposition idea powers recommendation systems (matrix factorization of the user × item ratings matrix) and Google's original PageRank (the dominant eigenvector of the web's link matrix).

6. Training is linear algebra in reverse

Gradient descent — the algorithm that trains every modern model — computes how the loss changes with respect to every weight matrix and steps downhill:

$$W \leftarrow W - \eta \, \frac{\partial L}{\partial W}$$

The gradient \(\partial L / \partial W\) is itself a matrix the same shape as \(W\), and backpropagation computes it by chaining matrix multiplications backward through the network — transposes everywhere, courtesy of the identity \((AB)^T = B^TA^T\). And the simplest learning model of all, linear regression, is solved exactly by the normal equation from Chapter 2: \(\theta = (X^TX)^{-1}X^Ty\).

7. SVD and LoRA: the low-rank revolution

The Singular Value Decomposition generalizes eigendecomposition to any matrix, square or not:

$$A = U \Sigma V^T$$

\(U\) and \(V\) are rotation matrices and \(\Sigma\) is diagonal — every matrix, no matter how messy, is a rotation, then a stretch, then another rotation. The diagonal entries of \(\Sigma\) (the singular values) rank the matrix's "directions" by importance. Keep only the top \(r\) of them and you get the best possible rank-\(r\) approximation of \(A\) — the math behind image compression and the Netflix-style recommender systems that factorize a sparse ratings matrix.

This idea now drives how LLMs are customized. Fine-tuning a 70-billion-parameter model by updating every weight matrix is prohibitively expensive. LoRA (Low-Rank Adaptation) observes that the change needed during fine-tuning is approximately low-rank, so instead of updating \(W\) directly it learns two thin matrices:

$$W' = W + BA \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times d},\; r \ll d$$

With \(r = 8\) instead of \(d = 4{,}096\), the trainable parameters drop by a factor of hundreds while quality barely moves. Nearly every custom fine-tune you encounter today is a pair of skinny matrices riding on top of a frozen giant one. Pure shape-rule arithmetic from Chapter 1.

8. Why GPUs rule the world

Matrix multiplication has a rare property: every output entry can be computed independently and simultaneously. A CPU with 16 cores computes 16 things at once; a modern GPU has tens of thousands of small cores and dedicated tensor units that do nothing but multiply-accumulate. The chip shortage, the data center buildout, the trillion-dollar valuations of chipmakers — all of it traces back to the demand for one mathematical operation you learned on the operations page.

The map, in one table

AI conceptMatrix machinery underneath
Neural network layery = f(Wx + b) — matrix multiplication plus a bias
Model "parameters"The entries of the weight matrices
Word embeddingsRows of a vocabulary × dimension matrix
Transformer attentionsoftmax(QKᵀ/√d)V — multiplication and transpose
Images / convolutionsPixel matrices; convolution recast as matmul
PCA / compressionEigendecomposition of the covariance matrix
RecommendationsFactorizing the user × item matrix
Fine-tuning (LoRA)Low-rank update W + BA via two thin matrices
Compression / recommendersTruncated SVD: A ≈ UΣVᵀ keeping top singular values
Training (backprop)Chained matrix products of gradients
GPUs / AI hardwareSilicon specialized for parallel matrix multiply

If any row of that table felt shaky, the chapter that explains it is one click away: Operations, Linear Equations, Eigenvalues & Eigenvectors. Ready to go deeper? The papers, books, and courses page maps the canonical sources.

← PreviousEigenvalues & Eigenvectors Back to startHome