Datasets · Open data, built by the community

Everything we make is openly licensed and documented. These datasets are built as a by-product of reading and annotating papers together, free to download, study, and build on.

Text / NLP CC BY 4.0

Plain-language paper explainers

2 entries

Original arXiv titles paired with a plain-language explainer at three depths (ELI15 · Student · Researcher), plus difficulty and AI-for-good labels.

How it's made: Drafted by Workers AI from the abstract, then reviewed and corrected by a human editor before publishing.

Browse the explainers Hugging Face mirror · in progress

Annotations CC BY 4.0

Community margin notes

7 approved notes

Plain-language annotations readers left on specific sentences, anchored to the paper they explain. The raw material of a 'how people explain research' dataset.

How it's made: Written by members, each one read and approved by an editor before it goes live.

See annotated papers Hugging Face mirror · in progress

Taxonomy CC BY 4.0

AI research direction map

19 directions

A curated, hierarchical taxonomy of AI research directions with plain-language descriptions and momentum (recent-activity) scores.

How it's made: Hand-curated by editors and kept fresh by the nightly ingest job.

Open the map Hugging Face mirror · in progress

Text / NLP CC BY 4.0

Plain-language AI glossary

8 terms

AI terms with short, jargon-free definitions, the human-approved glossary that grows as new papers are explained.

How it's made: Suggested by the pipeline and by members, approved by an editor.

Read the glossary Hugging Face mirror · in progress

Datasheet · how we document our data

We follow the "Datasheets for Datasets" standard so anyone can judge whether a dataset is right for their work.

Why was it collected?: To make AI research legible to newcomers and to study the gap between academic and plain language.
How was it collected?: Papers are sourced from arXiv and Hugging Face Daily Papers; explainers are AI-drafted and human-reviewed; notes are member-written and editor-approved.
How is quality assured?: Every published item passes human editorial review. We never present unreviewed AI output as verified.
Who are the contributors?: Marginalia members and editors. Contributors are credited; no personal data beyond a chosen display name is included.
What are the limits?: The collection skews toward beginner-friendly and social-good work, and toward recent papers. It is not a representative sample of all AI research.
How can it be used?: Freely, under CC BY 4.0, with attribution to Marginalia. Read more in how we work.

Want to help build these? Contribute a note or a label →