← Blog

Past the First Rung

An earlier post on this blog argued that an AI can tell you what usually goes with what, but not what actually causes things to happen — that the current generation of AI lives on the first rung of what the statistician Judea Pearl called the ladder of causation. Rung one is correlation. Rung two is intervention: you change something and see what changes with it. Rung three is counterfactual: you ask what would have happened if the world had gone differently. The AI, no matter how large, does rung-one work.

That post stopped at the ceiling. This one goes past it.

If the ceiling is real, what would it take to break through? What do the current systems actually produce when they work? And what does all of this tell us about where human labor is going — not in the hypothetical future, but right now?

The ceiling is not a bug

A common misunderstanding is that the AI's rung-one limitation is a defect that will get fixed by the next release, or the release after that. Just make the model bigger. Train it on more data. Give it more chain-of-thought steps. Surely at some point it becomes smart enough to reason about causes.

This is wrong, and it is wrong for a reason worth understanding.

An AI language model is built out of a joint probability distribution — a giant statistical summary of which things tend to appear with which other things across everything it has been trained on. When you query it, the system is sampling from that distribution, weighted by your prompt. Making the model bigger gives you a more detailed joint distribution. Training on more data gives you a denser one. Adding reasoning steps lets you chain multiple samples together. But the thing being sampled from is still a joint distribution: a record of what goes with what.

The 2020 theorem by Elias Bareinboim, Juan Correa, Duligur Ibeling, and Thomas Icard establishes that you cannot, even in principle, derive interventional or counterfactual answers from a joint distribution alone. The information simply is not there. You have to add something — a causal model, a real intervention, an assumed causal structure — from outside the distribution to get rung-two answers. No amount of scaling the joint distribution substitutes for that missing ingredient, because what's missing is not quantitative.

Think of it like a map. A very detailed map of a city can tell you that the bakery and the coffee shop are on the same block. It cannot tell you whether the bakery opened because of the coffee shop's foot traffic, or whether the coffee shop opened because the bakery had already drawn customers. To answer the causal question you need information the map doesn't contain — historical records, owner interviews, a controlled test where you close one and see what happens to the other. More mapping detail doesn't help. The mapmaker would have to become a historian or an economist, which is a different job.

The AI is a mapmaker. A very, very good one. But it is not a historian, not an economist, not an experimentalist. The ceiling is architectural — it is a property of what the system is, not a limit on how much of it there is.

The shape of what it produces

Knowing that the ceiling is architectural changes how you read what the AI produces.

When an AI writes an essay on a subject you don't know much about, it is drawing on everything it has read about that subject and recombining pieces that have plausibly gone together before. The essay can be correct. It can be coherent. It can cite real sources and make arguments that hold up under scrutiny. But the operation at the bottom is always some version of: what pieces from my training would plausibly fit here, and in what arrangement?

This produces outputs with a specific signature you can learn to see. When an AI synthesizes two disparate fields — when it writes about, say, how Buddhist philosophy connects to quantum mechanics, or how nineteenth-century labor history bears on modern gig work — what it is doing is finding the overlap region in its training data where both fields were discussed together, or where the same concepts were applied to both. The synthesis feels novel because most readers haven't seen those authors or those applications. But the synthesis is not invented at the moment of writing. It is retrieved — or more precisely, recombined — from material the AI has already seen.

This is valuable. A huge amount of human intellectual labor is also recombination. Literature reviews. Cross-disciplinary translations. Plain-language explanations of technical material. Style imitation. When you say an essay is "well-read," you are often saying it successfully combined things its author had been exposed to. The AI is, on this particular dimension, astoundingly well-read.

The signature reveals itself in what the AI cannot do. Ask it to predict what a genuinely novel intervention will produce — what would happen if you ran a specific never-before-tried experiment — and its answer will draw on whatever it has read about similar experiments. If no such experiments exist in its training, the answer will be plausible but uninformative. It is recombining from empty space. The output will still look confident. It will still read well. But the prediction doesn't come from understanding the mechanism; it comes from finding the statistical shadow that best fits the shape of the prompt.

Once you see this, you see it everywhere. The fluent-sounding confidence not backed by anything underneath. The plausibly-worded answer to a question the model has no way to know. This isn't deception. The model isn't deceiving you, because it isn't modeling your epistemic state at all. It is producing the most probable continuation, and the probable continuation to a well-formed question is a well-formed answer — regardless of whether the answer is right.

What would it take to climb higher

If the rung-one ceiling is architectural, what architectures could move past it? Researchers have identified several real paths, none of which are "make the current systems bigger." They are different systems built on different assumptions.

Learning causes, not just correlations. Instead of training on text that describes correlations, train systems to represent causal structure directly. The research program is called causal representation learning, and it is hard. The 2020 theorem says that a joint distribution alone can't supply causal structure, so the research is about which supplementary assumptions are reasonable, and what kinds of data or interventions can provide the missing information.

Intervention in the world. A system that only reads about the world can never verify a cause. A system that can actually change things — a robot picking up objects, an agent running experiments — can, in principle, learn at rung two. This is the domain of embodied agents and world-model-based reinforcement learning. Language-model-based agents that interact with software environments fall somewhere between: they intervene in code and web APIs, but not in physical reality, and the mapping to real-world causation is weaker than laboratory experiments provide.

Pairing with symbolic reasoning. Language models are fluent but unreliable; formal logic systems are reliable but brittle. A growing research direction pairs them: the language model generates candidate answers in natural language, a formal engine checks whether those answers are consistent with known facts or derivable from stated premises. The formal engine provides the discipline the language model lacks. This is called neurosymbolic AI.

Training on simulated experiments. Rather than training on text about experiments, train on the experiments themselves — simulated in a physics engine, a biological-process model, or an economic-agent simulation. The model then learns what happens when interventions are applied, not just what text reports about them. Robotic-control work already does a version of this when it pretrains in simulation before deploying in the physical world.

Systems that design their own experiments. Active learning is a research program in which the system decides which question to answer next based on what would most reduce its uncertainty. A system that can propose, run, and interpret experiments is, functionally, a scientist. Current systems fall far short of general-purpose self-directed science, but prototypes exist in narrow domains — protein folding, materials discovery, drug screening — where the experimental loop has been automated end to end.

Explicit counterfactual modeling. Instead of hoping counterfactual reasoning emerges from language-model training, build systems with a dedicated representation of counterfactual scenarios: what would have happened if X had been different. This used to be the province of causal inference in statistics and economics, and contemporary research is attempting to bring those tools inside neural architectures.

Interventional training data. The sticking point in causal representation learning is that observational data underdetermines causal structure. One response is to collect data from actual controlled experiments, and train on that. Such data is expensive to collect — interventions cost more than observations — but the information per sample is radically higher than observational data provides.

Human-in-the-loop coupling. For the foreseeable future, the most reliable systems pairing rung-one language capability with rung-two reasoning are humans holding a language model at arm's length — using the AI for pattern-matching while preserving human judgment for causal claims. This is not a research pathway so much as a deployment pattern, and it is what you are already doing when you use AI responsibly.

None of these paths is close to replacing what current systems do at rung one. All of them are hard. Most are a decade or more of research away from producing something a non-specialist would recognize as general-purpose. But they are the honest paths forward. "Bigger model, more data" is not on the list.

Where the labor is going

There is a useful distinction between two kinds of intellectual work. Pattern-finding is looking for regularities in what already exists. Mechanism-finding is figuring out what process produces those regularities. Most scientific and professional work is a dance between the two — you find a pattern, you propose a mechanism, you test the mechanism, you find a new pattern the mechanism predicts, and so on.

AI has radically cheapened pattern-finding. Reading thousands of papers to find the consensus view on a topic used to be a full-time job for a graduate student. Now it takes an hour. Writing up the literature review — another full-time job — is reduced to editing. The entire first half of the dance is being commoditized.

The second half — mechanism-finding — is exactly the rung-two work the ceiling excludes. Designing an experiment that could actually discriminate between two hypotheses. Running a clinical trial. Observing what happens when you change a manufacturing process. Debugging a system by swapping components. None of this is done by the AI. All of it is human work, and all of it has become proportionally more valuable as the other half cheapened.

This predicts several effects that are starting to be visible in labor markets right now:

  • Jobs centered on literature synthesis (entry-level paralegal, first-pass research, science journalism on settled topics, technical writing on established subjects) are compressing. Entry-level positions are disappearing faster than senior ones, because senior practitioners add the causal judgment the AI can't supply.

  • Jobs centered on intervention — field research, clinical practice, skilled trades, lab work, hands-on teaching — are not compressing. In several of these, the junior-to-senior wage gap is narrowing in favor of juniors, because employers need more people doing the actual intervention work.

  • Education is shifting, slowly, toward experimental design as an early skill rather than a late-graduate-school one. Any undergraduate science major can tell you that statistical literacy has become a prerequisite for every course. The reason is that the skill of asking how would we know? is now a basic competency rather than a specialization.

  • Hiring is slowly decoupling from written-output signals. A five-page case study used to signal domain competence. Now it might be written in thirty seconds by the AI. The credible signal has moved to process — drafts, notebooks, how a candidate thinks in real time, whether they can handle a case they haven't memorized or prepared for.

These are early effects. They will take a decade to fully play out. But they are all downstream of the same architectural fact: the AI is extremely good at rung one and structurally incapable of rung two.

What to do about it

The practical response is the same advice as before, sharpened.

Use AI for pattern work. Drafting. Summary. Translation. First passes at known problems. This is where the cost savings are, and they are substantial.

Do not outsource mechanism work. When the question is why did this happen or what will happen if we do X, the AI can tell you what the literature says about similar situations. It cannot tell you what will happen in your situation. That answer requires either experience (a practitioner who has seen the specific pattern in the specific conditions), an experiment (a test you design yourself), or explicit causal modeling (a tool most non-specialists don't have). Find one of the three. Don't settle for "the AI said so."

Learn to recognize the signature of recombination. When an answer is too smoothly plausible — when it reads well but doesn't name the specific mechanism, doesn't predict a specific outcome, doesn't stake anything — suspect recombination. Ask the AI to make a falsifiable prediction. Watch what happens. Usually the prediction will dissolve into hedges. That's the signature.

Value the people who test things. The world still runs on rung-two work. Engineers, lab scientists, field researchers, nurses, mechanics, carpenters, teachers, clinicians. The rhetoric that "AI will replace them" gets the direction exactly wrong. These are the jobs the ceiling actually protects. They are also, not coincidentally, the jobs that most directly affect whether your life goes well.

The ladder of causation is not infinite. We are on a rung, and that rung is where most of the real work is. Knowing that changes what to learn, what to trust, and what to teach your kids — and whom to pay well.


Keep reading

Four Roads to the Same Ceiling is a graduate-level bridge through the four research literatures whose intersection produces the argument these posts have been making: Pearl's causal inference and the Causal Hierarchy Theorem, Schölkopf's causal representation learning, the Bayesian-manifold reading of language-model inference, and Boden's taxonomy of creativity. It ends by stating the exact finding of the technical paper this series has been unfolding, with proper citations, and is the post to read if you want to engage with the underlying research directly.

Four Roads to the Same Ceiling


Originating prompt:

For the next blog article, carry on the entracement at a under grad level connection between doc 436 and the first blog post. Append this prompt to the artifact, and likewise, leave no authorial trace but this prompt.