Skip to content

How to Scale RL to 1026 FLOPs

Reinforcement learning (RL) is moving from “hard-to-wrangle side quest” to the core scaling path for frontier AI. In this long-form essay, Meta researcher Jack Morris lays out why post-training with RL on web-scale data will unlock new reasoning ability—without simply piling on more parameters.


TL;DR

TakeawayWhy it matters
Next wave ≠ bigger modelsPre-training is nearing diminishing returns; RL unlocks new compute inside fixed model sizes.
Current RL is tinyo1-style reasoning models train for thousands of steps vs. millions in pre-training.
Verifiable rewardsRLVR (RL w/ verifiable rewards) flourished in math & code. Morris argues the Web itself can be the next “verifier” via next-token loss.
Goal: 10^26 FLOPsGrok 3 already spent that much on supervised learning—RL must catch up to stay relevant.

Two waves of scaling

  1. 2018-2024 — Token soup
    Scrape the Internet → predict next token → make models bigger.
    Straightforward but compute-hungry.

  2. 2025+ — Reasoning flood
    Teach models to “think” longer via RL.
    Smaller models, more internal compute, new reward signal.


The proposed pivot: RL + Web Next-Token Prediction (RNTP)

“If a model can guess the next token, that is a verifiable reward.” — Morris

How it works

  1. Sample a chunk of Web text.
  2. Let the model insert <think> … <\think> tokens—its private chain-of-thought.
  3. Reward it for predicting the <answer> span correctly.
  4. Back-propagate; repeat millions of times.

Why it’s clever

  • Scalable data – no curation bottleneck; the whole Internet is fair game.
  • Automatic scoring – next-token loss is the reward, so no human labelling.
  • Domain-general – moves beyond the math/code bubble.

Challenges on the road to 10^26 FLOPs

ObstacleCurrent painNeeded breakthrough
Inference bottleneckEach RL step generates many tokens → slow.Faster engines (vLLM, SGLang) + batching tricks.
Verification costNon-math tasks lack cheap scoring.Web next-token loss sidesteps bespoke verifiers.
Compute heterogeneityGPU idle while CPU runs unit tests in code RL.Co-located verifier GPUs or async pipelines.
Hyperparam fogHow many <think> inserts per chunk? Unknown.Large-scale sweeps + model-souping experiments.

Why marketers should care

  1. Smarter small models

    • Edge-deployed creative-ops agents get “long-thinking” without GPU budgets.
  2. Faster iteration loops

    • RL fine-tunes on campaign data; every click can be a reward signal.
  3. New content verification markets

    • If the Web becomes a giant reward engine, structured brand data (product feeds, FAQs) becomes prime RL fuel—own it.
  4. Measurement rethink

    • Agents self-evaluate via next-token loss, not CTR; analytics must evolve to live A/B inside RL loops.

Pros & Cons

ProsCons
Infinite dataWeb reward ≈ endless.Web noise may dilute learning; needs great filtering.
Aligns w/ pre-training infraSame tokenization, same GPUs.Still experimental—no open-source repo at scale yet.
Parameter- efficient gainsFixed-size models get smarter.Doesn’t remove need for bigger models in some tasks.

Inspiration for the future

“The next GPT-3 moment won’t come from adding parameters; it’ll come from teaching models to use the parameters they already have.” — Jack Morris

If Morris’ roadmap holds, the frontier battleground shifts from bigger DCs to better reward design. In other words, your knowledge graph and first-party data could be the scarce fuel that powers the next reasoning leap.

Time to think about RL not just as a lab curiosity, but as the creative brief for every brand-owned LLM.


Further Reading