How to Scale RL to 10²⁶ FLOPs

Reinforcement learning (RL) is moving from “hard-to-wrangle side quest” to the core scaling path for frontier AI. In this long-form essay, Meta researcher Jack Morris lays out why post-training with RL on web-scale data will unlock new reasoning ability-without simply piling on more parameters.

TL;DR

Takeaway	Why it matters
Next wave ≠ bigger models	Pre-training is nearing diminishing returns; RL unlocks new compute inside fixed model sizes.
Current RL is tiny	o1-style reasoning models train for thousands of steps vs. millions in pre-training.
Verifiable rewards	RLVR (RL w/ verifiable rewards) flourished in math & code. Morris argues the Web itself can be the next “verifier” via next-token loss.
Goal: 10^26 FLOPs	Grok 3 already spent that much on supervised learning-RL must catch up to stay relevant.

Two waves of scaling

2018-2024 - Token soup
Scrape the Internet → predict next token → make models bigger.
Straightforward but compute-hungry.
2025+ - Reasoning flood
Teach models to “think” longer via RL.
Smaller models, more internal compute, new reward signal.

The proposed pivot: RL + Web Next-Token Prediction (RNTP)

“If a model can guess the next token, that is a verifiable reward.” - Morris

How it works

Sample a chunk of Web text.
Let the model insert <think> … <\think> tokens-its private chain-of-thought.
Reward it for predicting the <answer> span correctly.
Back-propagate; repeat millions of times.

Why it’s clever

Scalable data – no curation bottleneck; the whole Internet is fair game.
Automatic scoring – next-token loss is the reward, so no human labelling.
Domain-general – moves beyond the math/code bubble.

Challenges on the road to 10^26 FLOPs

Obstacle	Current pain	Needed breakthrough
Inference bottleneck	Each RL step generates many tokens → slow.	Faster engines (vLLM, SGLang) + batching tricks.
Verification cost	Non-math tasks lack cheap scoring.	Web next-token loss sidesteps bespoke verifiers.
Compute heterogeneity	GPU idle while CPU runs unit tests in code RL.	Co-located verifier GPUs or async pipelines.
Hyperparam fog	How many <think> inserts per chunk? Unknown.	Large-scale sweeps + model-souping experiments.

Why marketers should care

Smarter small models
- Edge-deployed creative-ops agents get “long-thinking” without GPU budgets.
Faster iteration loops
- RL fine-tunes on campaign data; every click can be a reward signal.
New content verification markets
- If the Web becomes a giant reward engine, structured brand data (product feeds, FAQs) becomes prime RL fuel-own it.
Measurement rethink
- Agents self-evaluate via next-token loss, not CTR; analytics must evolve to live A/B inside RL loops.

Pros & Cons

	Pros	Cons
Infinite data	Web reward ≈ endless.	Web noise may dilute learning; needs great filtering.
Aligns w/ pre-training infra	Same tokenization, same GPUs.	Still experimental-no open-source repo at scale yet.
Parameter- efficient gains	Fixed-size models get smarter.	Doesn’t remove need for bigger models in some tasks.

Inspiration for the future

“The next GPT-3 moment won’t come from adding parameters; it’ll come from teaching models to use the parameters they already have.” - Jack Morris

If Morris’ roadmap holds, the frontier battleground shifts from bigger DCs to better reward design. In other words, your knowledge graph and first-party data could be the scarce fuel that powers the next reasoning leap.

Time to think about RL not just as a lab curiosity, but as the creative brief for every brand-owned LLM.

How to Scale RL to 10²⁶ FLOPs

TL;DR

Two waves of scaling

The proposed pivot: RL + Web Next-Token Prediction (RNTP)

How it works

Why it’s clever

Challenges on the road to 10^26 FLOPs

Why marketers should care

Pros & Cons

Inspiration for the future

Further Reading

Other Articles

Routine 🔧 - Structured Planning That Lets LLM Agents Actually Finish the Job

How to Scale RL to 1026 FLOPs

TL;DR

Two waves of scaling

The proposed pivot: RL + Web Next-Token Prediction (RNTP)

How it works

Why it’s clever

Challenges on the road to 10^26 FLOPs

Why marketers should care

Pros & Cons

Inspiration for the future

Further Reading

Other Articles

Routine 🔧 - Structured Planning That Lets LLM Agents *Actually* Finish the Job

How to Scale RL to 10²⁶ FLOPs

Routine 🔧 - Structured Planning That Lets LLM Agents Actually Finish the Job