Theme signal over time

→

Taxonomy updated May 2026

Smoothing 0.30

Sensitivity 3.0

Show LOESS

Show changepoints

LOESS trend Trend shift

Series

Loading…

Emerging ✦ ▶

Loading…

Loading themes

Loading papers

Convergence Signals Themes spiking simultaneously across research, capital, hiring, and code

Computing convergence signals

Convergence History Full history heatmap — blue = active · bright blue (sqrt scale) = spike · hatched = pipeline gap · faded = not yet tracked

Building heatmap

Researcher Profiles Pre-company radar · funding recipients · conference presenters

🔍

Loading researcher profiles

—

Loading…

Loading digest

How it works

Methodology

Everything displayed in this dashboard is computed deterministically from raw signals. This page explains exactly what each number means, how it is calculated, and what the known limitations are — so you can make informed judgements about what the charts are and are not telling you.

1 · Data Sources

📄 arXiv

The primary signal. Papers are ingested weekly via the arXiv API, filtered to cs.* and stat.ML categories. Each paper's title + abstract is sent to Claude Sonnet which returns a list of theme tags. Tags are mapped to the 34-theme taxonomy via OVERRIDE_MAPPING (manual) or Claude classification (automatic). Paper count = distinct papers tagged with a theme that week.

Cadence: daily ingestion, weekly rollup. Holiday gaps (arXiv does not publish during US holiday weeks) appear as hatched columns in the heatmap.

🐙 GitHub

Tracked via a curated list of AI-lab orgs and researcher accounts in config/github_targets.py. Repo signals are counted per theme based on repo topics and description keywords. GitHub count = repos with meaningful activity touching a theme in the last 30 days.

🐦 Tweets

Keyword search across tracked AI accounts. Tweet count = mentions of theme-related keywords in a given week. Noisiest signal — prone to event spikes (conference announcements, viral papers).

💼 Hiring

Job postings scraped from Greenhouse, Ashby, and Workable boards of ~40 AI labs. Claude assigns each role a theme tag. Job count = open roles mentioning a theme. Slowest-moving signal — hiring lags research trends by 6–18 months.

2 · Theme Taxonomy

The taxonomy is a closed, manually-curated three-level hierarchy: 10 parent categories → 39 level-2 themes → 38 level-3 specialist themes (77 child themes total). Nothing is added automatically.

Parent Category	Child Themes
Foundation Models	Long Context · SSMs / Mamba · Mixture of Experts
Reinforcement Learning	RLHF · RLAIF · Online RL Fine-tuning · Process Reward Models
Reasoning & Planning	Chain-of-Thought · Test-Time Compute · Formal Verification · Symbolic / Neurosymbolic · Process Reward Models
Agentic Systems	Tool Use · Multi-Agent Coordination · Memory & Retrieval · World Models · Computer Use
Generative Models	Vision-Language · Video Understanding · Audio & Speech · Text-to-Image
Efficient ML	Quantization & Compression · Speculative Decoding · Custom Silicon · LLM Infrastructure · Efficient Attention
Learning Methods	Synthetic Data · Continual Learning
Interpretability & Safety	Interpretability · Red-Teaming & Jailbreak · Alignment Theory · Watermarking & Detection
Scientific & Applied ML	Code Generation · Scientific AI · Robotics & Embodied
Evaluation & Benchmarking	(no child themes yet — parent only)

Raw tags that do not map to these 77 themes land in child_tag_classifications as child tags. They are visible in the theme table's expanded child rows. Promoting a child tag to a full taxonomy theme requires a human decision: edit config/__init__.py, add an OVERRIDE_MAPPING entry, append to taxonomy_changelog.md, re-run classify_tags.py --all --force.

Topic selection rationale

Every theme in the taxonomy was included because it represents a distinct, active research programme with its own benchmarks, venues, and hiring signals — not simply because it is popular. The bar for inclusion is that a theme must be specific enough to be trackable (i.e. a paper either is or is not about it) but broad enough to accumulate meaningful weekly counts. Themes that are too narrow become noise; themes that are too broad become meaningless aggregates. Below is the reasoning for each category and its child themes.

Foundation Models

These themes track architectural and capability choices at the base model level — the decisions that propagate downstream to every application. A shift here (dense → MoE, fixed → long-context) changes what every downstream product can do.

Long Context — Context window length is the single most trackable proxy for capability expansion. The race from 4K → 128K → 1M tokens has been one of the clearest measurable trends in the field and directly determines which tasks (document analysis, multi-file coding, long conversations) are feasible at all.
SSMs / Mamba — The first credible architectural alternative to transformers with sub-quadratic scaling. Tracked because if transformers are eventually displaced, state-space models are the leading candidate; a trend signal here has outsized strategic relevance.
Mixture of Experts — The architecture behind GPT-4, Mixtral, and most frontier models' efficiency gains. MoE lets labs scale parameter count without proportional compute increases; understanding MoE adoption explains how the scaling laws are being extended.
In-Context Learning — The emergent ability to adapt from examples in the prompt without weight updates. Foundational to understanding how models generalise and to evaluating few-shot versus fine-tuning tradeoffs.

Reinforcement Learning

RL is the primary mechanism for post-training — taking a capable base model and aligning it to be useful, safe, and correct. Almost every deployed frontier model has gone through some form of RL-based fine-tuning. The child themes track distinct stages of that pipeline.

RLHF — Reinforcement Learning from Human Feedback is the technique that made ChatGPT viable; human preference labels train a reward model, which then fine-tunes the policy via PPO. Still the baseline alignment method and a prerequisite for understanding every variant that followed.
RLAIF — Replaces expensive human labellers with AI-generated feedback (Constitutional AI, self-critique). Tracked separately from RLHF because it fundamentally changes the cost structure of alignment — enabling labs to scale feedback without proportional human labour.
Online RL Fine-tuning — Running RL with live rollouts during training; the mechanism behind o1/o3-style reasoning models that learn to allocate thinking time before answering. This is the current frontier of post-training research.
Process Reward Models — Reward models that score intermediate reasoning steps rather than just final outputs. Essential for making chain-of-thought training tractable at scale; directly coupled to the test-time compute trend.
Model-Based RL — Learning an internal world model and planning within it rather than learning purely from environment interaction. Increasingly relevant as agents need to reason about multi-step consequences before acting.

Reasoning & Planning

Moving from pattern-matching to structured, multi-step reasoning is the defining capability challenge of the current generation. Benchmark performance on math, coding, and science correlates almost entirely with reasoning quality, making this category a proxy for overall progress.

Chain-of-Thought — The prompting technique that first demonstrated that reasoning could emerge from scale. "Think step by step" shifted model capability in mathematics and logic overnight and remains the baseline for everything in this category.
Test-Time Compute — Spending more compute at inference (repeated sampling, beam search, MCTS) to improve output quality. The o1/o3 paradigm; represents a fundamentally different scaling axis from training-time compute.
Neuro-Symbolic — Combining learned representations with formal rules or logical constraints. Relevant for domains where reliability and verifiability are non-negotiable (formal proofs, legal reasoning, safety-critical decisions).
Formal Verification — Proving correctness of model outputs using formal methods. Growing field at the intersection of programming languages research and AI safety; tracked because regulatory pressure will likely push high-stakes deployments toward verified outputs.

Agentic Systems

The shift from single-turn Q&A to multi-step autonomous agents is arguably the most commercially significant trend in applied AI. Every major lab is building toward agentic products. The child themes decompose the key sub-problems of agent architecture.

Tool Use — Calling external APIs, running code, searching the web. What transforms a language model from a text generator into an agent capable of real-world actions. Without tool use, agents are bounded by their training data.
Multi-Agent Coordination — Multiple specialised agents working in parallel or sequence. Enables division of labour and specialisation beyond what a single context window can handle; the architecture behind most enterprise agent frameworks.
Memory & Retrieval — How agents maintain state and access relevant information beyond their context window limit. The core infrastructure challenge for long-running agents; without it, every session starts from scratch.
World Models — Learned simulators of environment dynamics. Critical for planning agents that need to reason about consequences before acting, rather than reacting to each step in isolation.
Computer Use — GUI and browser agents that operate software as a human would. The commercial frontier of agent deployment and the basis for most enterprise automation products; tracked because it is where agent research meets immediate commercial revenue.

Generative Models

Generative capability across modalities is both a research frontier and the basis for most consumer AI products. Tracked separately from foundation models because the research communities, benchmarks, and commercial dynamics are distinct.

Vision-Language Models — Multimodal models that combine image understanding with language generation; now a standard capability in frontier models (GPT-4V, Gemini, Claude). Tracked because multimodal capability unlocks qualitatively new use cases.
Video Understanding — Temporal reasoning over video frames. Relevant for content creation, surveillance, scientific analysis, and robotics; Sora demonstrated that video generation is within reach of current architectures.
Audio & Speech — Recognition, synthesis, and generation. The interface layer for voice-first applications and a growing area of capability competition between labs and startups.
Text-to-Image / Diffusion — Stable Diffusion, DALL-E, Midjourney. The consumer AI wave that demonstrated generative AI to the public; diffusion models are also spreading into non-image domains (protein structure, video, audio).

Efficient ML

Running capable models at acceptable cost and latency is the engineering constraint that determines which applications are commercially viable. Efficiency research often lags capability research by 12–18 months but has outsized economic impact once it arrives.

Quantization & Compression — Reducing model precision (FP16 → INT8 → INT4) to fit models on consumer hardware without meaningful quality loss. The reason local AI deployment is possible and the key enabler of on-device inference.
Speculative Decoding — Using a small draft model to predict tokens that a larger model verifies in parallel; delivers 2–4× inference speedup with no accuracy trade-off. Increasingly standard in production serving stacks.
Custom Silicon — TPUs, H100/H200, Groq, Cerebras, Tenstorrent. The compute substrate beneath everything else. Tracked because silicon availability and pricing constraints determine who can build what, and the competitive dynamics here are moving fast.
LLM Infrastructure — Serving frameworks, batching strategies, distributed inference, KV-cache management. The operational layer that bridges research models and production deployments at scale.
Efficient Attention — FlashAttention, linear attention, sparse attention variants. Reducing the O(n²) memory and compute bottleneck that limits context length and throughput; a prerequisite for long-context models to be practically deployable.

Learning Methods

How models are trained — not just what data they see — determines how well capabilities generalise. These themes cover training-time innovations distinct from the RL post-training category.

Synthetic Data — Using AI to generate training data at scale. Increasingly the only path to acquiring training signal in domains where human annotation is expensive, slow, or impossible. The strategic implication: labs that can generate high-quality synthetic data have a compounding data advantage.
Continual Learning — Training models on new information without catastrophic forgetting of prior knowledge. Critical for deployed models that need to stay current without full retraining from scratch.

Interpretability & Safety

As models become more capable and more widely deployed, understanding what they are doing and ensuring they remain safe becomes both a research priority and a regulatory requirement. This category tracks the technical work on both fronts.

Interpretability — Understanding which internal model features correspond to which concepts; mechanistic interpretability (circuits, sparse autoencoders, feature attribution) is the leading methodology. Tracked because interpretability is the prerequisite for principled safety work.
Red-Teaming & Jailbreaks — Finding failure modes before deployment. The attack-defence dynamic drives both safety research and model hardening; each new jailbreak technique advances the field's understanding of where models are brittle.
Alignment Theory — Constitutional AI, preference learning, value alignment — the formal frameworks for specifying what we want models to do and verifying that they do it. Includes DPO, reward modelling, and hallucination mitigation.
Watermarking & AI Detection — Techniques for marking AI-generated content and detecting it at scale. Tracked because of growing regulatory relevance (EU AI Act, US executive orders) and because detection capability shapes how AI content is treated by platforms and institutions.

Scientific & Applied ML

AI's impact outside the AI field itself — in medicine, materials science, software engineering, and physical systems — is where much of the societal value will be realised. These themes track application domains with enough dedicated research activity to warrant their own signal.

Code Generation — GitHub Copilot, Cursor, Devin, SWE-bench. Software engineering is the highest-value near-term application of AI; tracked as a leading indicator of AI's economic impact on knowledge work and as a proxy for model reasoning quality.
AI for Science — AlphaFold, drug discovery, materials science, climate modelling. High-impact applications where AI is compressing decades of experimental research into months of compute. Distinct research community with its own venues (NeurIPS AI for Science workshop, Nature Machine Intelligence).
Robotics & Embodied AI — Physical AI systems that perceive and act in the real world; Tesla Optimus, Figure, Boston Dynamics, and a growing set of academic labs. Tracked because embodied AI closes the loop from language to physical action, and commercial investment here is accelerating rapidly.

Evaluation & Benchmarking

Progress in AI is only meaningful if the measurement infrastructure keeps pace with capability. Benchmark saturation — models acing tests designed to be hard for humans — and benchmark contamination — training data leaking into test sets — make evaluation a first-class research problem rather than just a reporting exercise. This category tracks the work that determines whether any of the other trends in this dashboard are real.

3 · Labels & Tag Classification

Every paper ingested from arXiv is sent to Claude Sonnet, which reads the title and abstract and returns a list of free-form tags (e.g. lora-fine-tuning, sparse-autoencoders, pde-solving). These raw tags are the atomic unit of the pipeline — everything else (theme velocity, convergence scores, trends) is an aggregate of tag counts.

From raw tag → child tag → theme

Tag classification happens in three layers, applied in priority order:

Layer	Mechanism	Example
1 · Force exclude	Tags in `FORCE_EXCLUDE` (config) are silently dropped — never stored, never shown. Used for tags that are too generic or off-topic.	`training`, `model`, `method`
2 · Manual override	Tags in `OVERRIDE_MAPPING` are pinned to a specific child theme. Takes precedence over everything else.	`lora-fine-tuning` → `quantization-compression`
3 · Claude classification	Remaining tags are sent to Claude in batches of 20. Claude assigns each tag to one of the 77 child themes (or `other`) and assigns a confidence level.	`sparse-autoencoders` → `interpretability` (high)
4 · Unmapped / other	Tags Claude can't confidently place in any preset theme are mapped to `other` and excluded from trend charts. They remain in `child_tag_classifications` for future review.	`causal-inference`, `game-theory`

Confidence levels

Each Claude-classified tag is assigned one of three confidence levels, stored in child_tag_classifications.confidence:

Confidence	Meaning	Role in emerging detection
High	Tag clearly belongs to the assigned theme. Classification is stable.	Excluded from emerging candidates — already well-classified
Medium	Tag loosely fits the theme, but could plausibly belong elsewhere or merit its own theme.	Eligible for emerging promotion if other criteria met
Low	Tag was placed somewhere but the fit is weak. May be a genuinely new concept without a home.	Eligible for emerging promotion — highest priority candidates

Flagged tags

During classification, Claude can flag a tag for human review — cases where it is genuinely uncertain between two different parent categories (not just between child themes). Flagged tags are written to child_tag_classifications.flagged = true. Run python pipelines/classify_tags.py --review locally to action them.

Where labels appear in the UI

UI element	What it shows
Theme table — parent row	Aggregate of all child theme counts under that parent category
Theme table — child rows (expanded)	One row per child theme: Tweets · GitHub · Papers separately, plus a sparkline
Series panel	The 77 preset child themes only — no discovered labels
Emerging ✦ panel (sidebar)	Approved discovered labels from `label_registry` that have velocity data but no home in the preset taxonomy yet
Convergence heatmap	The 77 preset child themes only, organised by parent category
Label detail panel (click label name)	The actual papers tagged with that label, filterable by source

Label registry & discovered labels

When classify_tags.py is run with the --approve flag on a tag, it is written to label_registry as an approved discovered label. These labels have a display_name, a parent_category (which may or may not match the preset taxonomy), and optionally a description. In the dashboard they appear in the Emerging ✦ sub-panel in the series sidebar and at the bottom of the theme table — never mixed into the 34 preset themes.

Preset taxonomy is closed. Approved discovered labels with parent categories that do not match one of the 10 preset parent keys are displayed under a virtual Discovered grouping — they never auto-register as new parent categories. Promotion to a full preset theme always requires a human edit to config/__init__.py and taxonomy_changelog.md.

4 · Share of Total (chart Y-axis)

The main chart plots share of total, not raw paper count:

share_of_total = paper_count_for_theme ÷ total_arXiv_submissions_that_week

This normalises for the overall growth of arXiv — without it, every theme would trend upward simply because arXiv submissions have grown 3× since 2022. A theme at 8% means 8 out of every 100 papers that week touched that theme.

LOESS smoothing (bandwidth 0.30, adjustable in Chart Settings) is applied before drawing the line. LOESS (Locally Weighted Scatterplot Smoothing) fits a local polynomial at each point using a weighted window of neighbouring weeks. It removes week-to-week noise while preserving genuine trend direction.

Known limitation — right-edge bias. LOESS always flattens or dips slightly at the rightmost weeks of the data window because the smoothing window is one-sided at the edge. The most recent 4–6 weeks will visually look lower than they actually are. This is not a signal of decline — it is a smoothing artefact. Reduce the smoothing bandwidth or disable LOESS in Chart Settings to see raw values.

Known limitation — partial-week data. The pipeline runs on a schedule. The current week's counts are always incomplete until the end-of-week run completes. The most recent week will always appear lower than it will end up.

5 · Convergence Score

A theme "converges" when multiple independent data sources spike simultaneously — suggesting a genuine breakthrough rather than a single noisy signal. The convergence score (0–4 scale) is pre-computed weekly in pipelines/convergence.py and stored in convergence_signals.

Spike detection

Each source maintains a 12-week rolling baseline (mean + standard deviation) per theme. A spike is declared when:

Source	Weight	Spike threshold	Min baseline weeks
📄 arXiv papers	2.0	≥ 1.0 σ above baseline	6
🐙 GitHub repos	1.0	≥ 1.0 σ above baseline	6
🐦 Tweets	0.75	≥ 0.75 σ above baseline	4
💼 Hiring	0.5	≥ 0.5 σ above baseline	4

Score formula

raw_score = Σ (weight × spike_magnitude) for each spiking source
convergence_score = min(raw_score, 4.0)

Where spike_magnitude = (current − baseline_mean) / baseline_std. A score of 4.0 means all four sources spiked simultaneously at maximum weight. The score is normalised to 0–4 so it stays comparable across all historical weeks.

Signal strength labels

Score range	Label	Meaning
≥ 3.0	Strong	3–4 sources spiking at high magnitude
≥ 2.0	Moderate	2–3 sources spiking, or 1 source at very high magnitude
≥ 0.5	Weak	Single-source spike or early signal

Thresholds are intentionally low. arxiv=1.0σ and tweets=0.75σ were calibrated to produce ~6–8 signals/week on average over 181 weeks of history. Raising them would suppress early signals; lowering them further would create noise.

6 · Convergence Heatmap

The heatmap visualises the full history of convergence signals for all 34 themes simultaneously. Each cell is one theme × one week.

Cell state	Meaning
Dark / empty	No papers for this theme that week (theme not active)
Faint blue	Papers exist but no convergence spike this week
Bright blue	Convergence spike — opacity = √(score ÷ 4), so brighter = higher score
Hatched	Pipeline gap — arXiv holiday or missed pipeline run that week

Rows are ordered by all-time paper count (highest-volume themes first within each category). The Fit all mode auto-sizes cells to your viewport width so the entire history fits without scrolling. Zoom in shows the top 20 themes at a larger cell size.

7 · Emerging Category Detection

The pipeline continuously scans child_tag_velocity for tags that may warrant promotion to the taxonomy. A tag is flagged as an emergence candidate only if it meets all five of the following criteria:

Volume ≥ 30 papers total — filters out one-off tags
≥ 4 consecutive weeks of growth — sustained, not a single spike
Confidence is medium or low — not already confidently classified into an existing theme
First seen within 18 months — genuinely new, not a slowly-growing legacy tag
Not already top-3 child of its parent — must be truly breakout, not already dominant

This detection is read-only and never auto-promotes. Run python pipelines/classify_tags.py --emerging locally to see current candidates. Promotion to the taxonomy is a human decision requiring edits to config/__init__.py, OVERRIDE_MAPPING, and taxonomy_changelog.md.

8 · Why the chart may look different after a taxonomy change

When the taxonomy enforcement was tightened (May 2026), three things changed that affect chart appearance:

Active series reset. Previously, approved discovered labels (e.g. Sparse Autoencoders, Federated Learning) could appear in the series panel and be selected as active chart lines. Some of those had rising trends. After enforcement, only the 77 preset themes are charted — different set of lines, different trajectories.
LOESS right-edge bias. The most recent 4–6 weeks of any time series will appear to flatten or dip due to LOESS smoothing — this is always present, not a result of any data change. Disable LOESS in Chart Settings to see raw weekly values.
Partial current-week data. The most recent week's counts are incomplete until the pipeline's end-of-week run. The last data point is always understated.

The underlying share_of_total values stored in the database are not modified by taxonomy enforcement — only which themes are displayed changes.

Loading digest