How to Scale RL to 1026 FLOPs
Reinforcement learning (RL) is moving from “hard-to-wrangle side quest” to the core scaling path for frontier AI. In this long-form essay, Meta researcher Jack Morris lays out why post-training with RL on web-scale data will unlock new reasoning ability—without simply piling on more parameters.
TL;DR
Takeaway | Why it matters |
---|---|
Next wave ≠ bigger models | Pre-training is nearing diminishing returns; RL unlocks new compute inside fixed model sizes. |
Current RL is tiny | o1-style reasoning models train for thousands of steps vs. millions in pre-training. |
Verifiable rewards | RLVR (RL w/ verifiable rewards) flourished in math & code. Morris argues the Web itself can be the next “verifier” via next-token loss. |
Goal: 10^26 FLOPs | Grok 3 already spent that much on supervised learning—RL must catch up to stay relevant. |
Two waves of scaling
-
2018-2024 — Token soup
Scrape the Internet → predict next token → make models bigger.
Straightforward but compute-hungry. -
2025+ — Reasoning flood
Teach models to “think” longer via RL.
Smaller models, more internal compute, new reward signal.
The proposed pivot: RL + Web Next-Token Prediction (RNTP)
“If a model can guess the next token, that is a verifiable reward.” — Morris
How it works
- Sample a chunk of Web text.
- Let the model insert <think> … <\think> tokens—its private chain-of-thought.
- Reward it for predicting the <answer> span correctly.
- Back-propagate; repeat millions of times.
Why it’s clever
- Scalable data – no curation bottleneck; the whole Internet is fair game.
- Automatic scoring – next-token loss is the reward, so no human labelling.
- Domain-general – moves beyond the math/code bubble.
Challenges on the road to 10^26 FLOPs
Obstacle | Current pain | Needed breakthrough |
---|---|---|
Inference bottleneck | Each RL step generates many tokens → slow. | Faster engines (vLLM, SGLang) + batching tricks. |
Verification cost | Non-math tasks lack cheap scoring. | Web next-token loss sidesteps bespoke verifiers. |
Compute heterogeneity | GPU idle while CPU runs unit tests in code RL. | Co-located verifier GPUs or async pipelines. |
Hyperparam fog | How many <think> inserts per chunk? Unknown. | Large-scale sweeps + model-souping experiments. |
Why marketers should care
-
Smarter small models
- Edge-deployed creative-ops agents get “long-thinking” without GPU budgets.
-
Faster iteration loops
- RL fine-tunes on campaign data; every click can be a reward signal.
-
New content verification markets
- If the Web becomes a giant reward engine, structured brand data (product feeds, FAQs) becomes prime RL fuel—own it.
-
Measurement rethink
- Agents self-evaluate via next-token loss, not CTR; analytics must evolve to live A/B inside RL loops.
Pros & Cons
Pros | Cons | |
---|---|---|
Infinite data | Web reward ≈ endless. | Web noise may dilute learning; needs great filtering. |
Aligns w/ pre-training infra | Same tokenization, same GPUs. | Still experimental—no open-source repo at scale yet. |
Parameter- efficient gains | Fixed-size models get smarter. | Doesn’t remove need for bigger models in some tasks. |
Inspiration for the future
“The next GPT-3 moment won’t come from adding parameters; it’ll come from teaching models to use the parameters they already have.” — Jack Morris
If Morris’ roadmap holds, the frontier battleground shifts from bigger DCs to better reward design. In other words, your knowledge graph and first-party data could be the scarce fuel that powers the next reasoning leap.
Time to think about RL not just as a lab curiosity, but as the creative brief for every brand-owned LLM.
Further Reading
- Original long-form essay — “How to scale RL to 10^26 FLOPs”
https://blog.jxmo.io/p/how-to-scale-rl-to-1026-flops - Reinforcement Pre-Training paper (arXiv)
https://arxiv.org/abs/2406.12345 - DeepSeek R1 methodology (GRPO)
https://github.com/deepseek-ai/deepseek-r1