The model that taught computers to read both directions at once

orig. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” · Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Large Language Models Intermediate 4 min read Written, reviewed by Marginalia Editorial

In the margin

Area

Large Language Models, Models trained on huge amounts of text that can read, write, summarise, and reason in natural language.

For a few years, this one model quietly powered a huge chunk of Google Search and a wave of language tools.

What's going on

Earlier models read text left to right. BERT reads the whole sentence at once, looking both ways, which helps it understand how words depend on each other. It learns by playing fill-in-the-blank on huge amounts of text: hide a word, guess it from the context. After that general training, it can be quickly adapted to specific jobs like answering questions or judging sentiment.

Why it matters

BERT made the now-standard recipe popular: train one big model on lots of text, then fine-tune it for each task. That recipe is behind most modern language tools. It also went straight into real products, including search, soon after release.

Source

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Google AI Language

View on arXiv ↗ PDF ↗

We write original plain-language summaries and link to the source. We never republish the paper.

Still fuzzy on a sentence?

Paste it and we'll explain it even more simply.

Test your understanding

Pass all three to earn the “read & understood” stamp (+10 pts).

Member notes Sign in ↗

ME Marginalia Editorial TEAM

We read the full paper and rewrote it in plain language. Leave your own note below.

Leaderboard · this week

Pass quizzes and leave notes to climb your chapter's board. No chapters are running yet, so this one is wide open.

Start a chapter to compete →