Go deeper

Papers, Books & Courses

A short, opinionated list. Everything here is either free or a recognized standard — no filler.

The landmark papers

Each of these changed the field, and each is readable once you know the operations on this site. Listed in the order the story unfolded.

PaperYearWhy it matters
ImageNet Classification with Deep CNNs (AlexNet) — Krizhevsky, Sutskever, Hinton2012The result that started the deep learning era: image matrices + GPU matrix multiplication beat everything else
Efficient Estimation of Word Representations (word2vec) — Mikolov et al.2013Words as vectors; meaning as geometry. The origin of "king − man + woman = queen"
Deep Learning — LeCun, Bengio, Hinton (Nature review)2015The field's three pioneers explain the whole stack in one survey
Attention Is All You Need — Vaswani et al.2017Introduced the transformer and the QKᵀ attention formula. The most consequential AI paper of the century so far
Language Models are Few-Shot Learners (GPT-3) — Brown et al.2020Demonstrated that scaling weight matrices to 175B parameters produces qualitatively new abilities
LoRA: Low-Rank Adaptation of Large Language Models — Hu et al.2021Fine-tuning via two thin matrices, W + BA. Now the default way to customize LLMs
Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al.2022The scaling laws: for a fixed compute budget, how big should the weight matrices be versus how much data should they see
FlashAttention — Dao et al.2022Made attention fast by tiling the QKᵀ multiplication to fit GPU memory hierarchy. Pure matrix-blocking engineering; now in every inference stack
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al.2023Squeezes weight matrices to 4-bit entries, then fine-tunes with LoRA on top — frontier-class models on a single GPU
Language Modeling Is Compression — Delétang et al. (Google DeepMind)2023Proves prediction and compression are two views of the same thing: a 70B-parameter weight matrix compresses images and audio better than PNG or FLAC. Reframes what those matrices fundamentally are
Mamba: Linear-Time Sequence Modeling — Gu & Dao2023The leading challenger to attention: structured state-space matrices that scale linearly with sequence length instead of quadratically
DeepSeek-V2: Multi-Head Latent Attention — DeepSeek-AI2024Compresses the attention KV cache through a low-rank latent matrix — the SVD idea applied to inference cost. The efficiency breakthrough behind DeepSeek's cheap frontier models
DeepSeek-R1: Incentivizing Reasoning via RL — DeepSeek-AI2025Reasoning ability trained with reinforcement learning alone — the paper that set off the 2025 reasoning-model race

Honorable mentions: The Matrix Calculus You Need For Deep Learning by Parr and Howard — exactly what the title says, written for practitioners — and The Era of 1-bit LLMs (BitNet b1.58), which shows weight matrices restricted to entries of just −1, 0, and 1 can nearly match full-precision models, pointing at a future of radically cheaper hardware.

The books

For the linear algebra itself

Introduction to Linear Algebra — Gilbert Strang. The standard. Strang teaches matrices as actions on space, the same lens this site uses. Pairs with his free MIT lectures below.

Linear Algebra Done Right — Sheldon Axler. The elegant, proof-first treatment. Read this second, when you want to know why rather than how. Free online edition available from the author.

Numerical Linear Algebra — Trefethen & Bau. How matrix computation actually behaves in floating point — the bridge between textbook math and what runs on a GPU.

For the AI connection

Mathematics for Machine Learning — Deisenroth, Faisal, Ong. Free PDF from the authors. Builds exactly the linear algebra, calculus, and probability that ML requires and nothing else. The single best next step after this site.

Deep Learning — Goodfellow, Bengio, Courville. Free online. Chapter 2 is a complete linear algebra refresher; the rest shows the matrices at work in every architecture.

The courses

CourseFormatBest for
Essence of Linear Algebra — 3Blue1Brown~16 short videosGeometric intuition. Watch this first; it is the animated version of everything on this site
MIT 18.06 Linear Algebra — Gilbert StrangFull lecture course, freeThe complete university treatment, taught by the master
Computational Linear Algebra — fast.aiCode-first notebooksDoing it in Python: SVD, PCA, and matrix decompositions on real data

A suggested path

If you are starting from this site: watch the 3Blue1Brown series (a weekend), work through Mathematics for Machine Learning chapters 2–4 and 10 (PCA), then read Attention Is All You Need with the Matrices in AI page open beside it. At that point the modern AI literature is open to you.

← PreviousMatrices in AI Back to startHome