The landmark papers
Each of these changed the field, and each is readable once you know the operations on this site. Listed in the order the story unfolded.
| Paper | Year | Why it matters |
|---|---|---|
| ImageNet Classification with Deep CNNs (AlexNet) — Krizhevsky, Sutskever, Hinton | 2012 | The result that started the deep learning era: image matrices + GPU matrix multiplication beat everything else |
| Efficient Estimation of Word Representations (word2vec) — Mikolov et al. | 2013 | Words as vectors; meaning as geometry. The origin of "king − man + woman = queen" |
| Deep Learning — LeCun, Bengio, Hinton (Nature review) | 2015 | The field's three pioneers explain the whole stack in one survey |
| Attention Is All You Need — Vaswani et al. | 2017 | Introduced the transformer and the QKᵀ attention formula. The most consequential AI paper of the century so far |
| Language Models are Few-Shot Learners (GPT-3) — Brown et al. | 2020 | Demonstrated that scaling weight matrices to 175B parameters produces qualitatively new abilities |
| LoRA: Low-Rank Adaptation of Large Language Models — Hu et al. | 2021 | Fine-tuning via two thin matrices, W + BA. Now the default way to customize LLMs |
| Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al. | 2022 | The scaling laws: for a fixed compute budget, how big should the weight matrices be versus how much data should they see |
| FlashAttention — Dao et al. | 2022 | Made attention fast by tiling the QKᵀ multiplication to fit GPU memory hierarchy. Pure matrix-blocking engineering; now in every inference stack |
| QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. | 2023 | Squeezes weight matrices to 4-bit entries, then fine-tunes with LoRA on top — frontier-class models on a single GPU |
| Language Modeling Is Compression — Delétang et al. (Google DeepMind) | 2023 | Proves prediction and compression are two views of the same thing: a 70B-parameter weight matrix compresses images and audio better than PNG or FLAC. Reframes what those matrices fundamentally are |
| Mamba: Linear-Time Sequence Modeling — Gu & Dao | 2023 | The leading challenger to attention: structured state-space matrices that scale linearly with sequence length instead of quadratically |
| DeepSeek-V2: Multi-Head Latent Attention — DeepSeek-AI | 2024 | Compresses the attention KV cache through a low-rank latent matrix — the SVD idea applied to inference cost. The efficiency breakthrough behind DeepSeek's cheap frontier models |
| DeepSeek-R1: Incentivizing Reasoning via RL — DeepSeek-AI | 2025 | Reasoning ability trained with reinforcement learning alone — the paper that set off the 2025 reasoning-model race |
Honorable mentions: The Matrix Calculus You Need For Deep Learning by Parr and Howard — exactly what the title says, written for practitioners — and The Era of 1-bit LLMs (BitNet b1.58), which shows weight matrices restricted to entries of just −1, 0, and 1 can nearly match full-precision models, pointing at a future of radically cheaper hardware.
The books
For the linear algebra itself
Introduction to Linear Algebra — Gilbert Strang. The standard. Strang teaches matrices as actions on space, the same lens this site uses. Pairs with his free MIT lectures below.
Linear Algebra Done Right — Sheldon Axler. The elegant, proof-first treatment. Read this second, when you want to know why rather than how. Free online edition available from the author.
Numerical Linear Algebra — Trefethen & Bau. How matrix computation actually behaves in floating point — the bridge between textbook math and what runs on a GPU.
For the AI connection
Mathematics for Machine Learning — Deisenroth, Faisal, Ong. Free PDF from the authors. Builds exactly the linear algebra, calculus, and probability that ML requires and nothing else. The single best next step after this site.
Deep Learning — Goodfellow, Bengio, Courville. Free online. Chapter 2 is a complete linear algebra refresher; the rest shows the matrices at work in every architecture.
The courses
| Course | Format | Best for |
|---|---|---|
| Essence of Linear Algebra — 3Blue1Brown | ~16 short videos | Geometric intuition. Watch this first; it is the animated version of everything on this site |
| MIT 18.06 Linear Algebra — Gilbert Strang | Full lecture course, free | The complete university treatment, taught by the master |
| Computational Linear Algebra — fast.ai | Code-first notebooks | Doing it in Python: SVD, PCA, and matrix decompositions on real data |
A suggested path
If you are starting from this site: watch the 3Blue1Brown series (a weekend), work through Mathematics for Machine Learning chapters 2–4 and 10 (PCA), then read Attention Is All You Need with the Matrices in AI page open beside it. At that point the modern AI literature is open to you.