Teaching a small model to copy a big one

orig. “Distilling the Knowledge in a Neural Network” · Geoffrey Hinton, Oriol Vinyals, Jeff Dean

Efficient AI Intermediate 3 min read Written, reviewed by Marginalia Editorial

In the margin

Area

Efficient AI, Making models smaller, faster, and cheaper to run so AI can work on phones and modest hardware.

Big models are accurate but slow. This paper shows how to pour most of that skill into a small, fast model.

What's going on

The trick is to train a small student model to copy the outputs of a large teacher model, not just the right answers but how confident the teacher is across all the options. Those soft signals carry extra hints that help the student learn more than it could from the labels alone. The result is a smaller model that runs faster and cheaper while keeping much of the accuracy. This is called distillation.

Why it matters

Distillation is a big reason capable AI can run on phones and in browsers. As models get larger, shrinking them down without losing much skill becomes more valuable. It is one of the core tools for making AI practical and affordable.

Source

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, Google

View on arXiv ↗ PDF ↗

We write original plain-language summaries and link to the source. We never republish the paper.

Still fuzzy on a sentence?

Paste it and we'll explain it even more simply.

Test your understanding

Pass all three to earn the “read & understood” stamp (+10 pts).

Member notes Sign in ↗

ME Marginalia Editorial TEAM

We read the full paper and rewrote it in plain language. Leave your own note below.

Leaderboard · this week

Pass quizzes and leave notes to climb your chapter's board. No chapters are running yet, so this one is wide open.

Start a chapter to compete →