Reading the Flow: A Field Guide to Looking Inside a Language Model

June 16, 2026 — James Henry

There is a growing corner of computer science called mechanistic interpretability that asks what is a language model actually doing in there? It has tools, conferences, rival schools, and -- as of about eighteen months ago -- its first real crisis of confidence. I want to walk through the toolkit as it stands in mid-2026: what each instrument actually measures, what it's genuinely good at, and where each one goes quiet. Not a comprehensive survey, but most of what I'm aware of is here.

The mental model: weights are the riverbed, activations are the water

The instinct most people have is that a Large Language Model's knowledge lives in its weights, and interpretability means reading the weights. In that story the flow of context through an LLM is like a river rapids with water flowing over, around, and beside rocks in the riverbed. The intuition is that the weights of an LLM are fixed rocks in the stream and that after training they don't move. Static structure shaping how tokens flow across and around them.

What actually moves through a river rapids is the water. What actually moves through a transformer at inference is the activation: a token's representation getting pushed, bent, split, and merged as it passes down through the layers. The weights are the channel; the activation is the water. Inference is the flow the structure induces.

A river rock is passive -- it only blocks or deflects water. A transformer's structure answers back. Through attention, the flow at one position is redirected by what's happening at every other position, so the effective channel reconfigures for every input. Same fixed weights, wildly different trajectories. It's less a riverbed of rocks than a field of turbines whose deflection of the water depends on the water itself. That's the thing that makes inference interesting rather than a lookup table -- and it's also the thing every interpretability tool is trying, from a different angle, to read.

We will use this river rapids analogy to help us talk about the various tools but let's be clear: this is a loose analogy distilling structures with many thousands of dimensions and potentially hundreds of layers into a human-understandable picture.

The organizing idea: each method is a different instrument dropped into the same flow, registering a different property of it. Some read which turbines talk to which. Some read what's dissolved in the water at a cross-section. Some yank a component out and watch what changes downstream. None of them reads the whole river.

Instrument 1: Attention maps -- who's talking to whom

Attention maps are the oldest and most intuitive interpretability tool, because they're free: the attention weights are right there in the forward pass, no extra machinery required. For each attention head you get a matrix saying, in effect, "when the model processed this token, how much did it look at each earlier token?" Render it as a heatmap and you get those evocative pictures -- the verb attending to its subject, the pronoun attending to its antecedent.

In the river picture, attention maps are the view from above: at any given rock, you can trace backwards through the eddies and channels that shaped the flow arriving there. The picture is real. It shows routing. It does not show what the water is carrying, or whether any of it will matter downstream.

The appeal is obvious and so is the trap. It is enormously tempting to read an attention map as an explanation: "the model did X because it attended to Y." Around 2019 the field had a genuine, productive fight about exactly this, and the fight is still the right thing to understand before you trust one of these maps.

The opening shot was Jain & Wallace, "Attention is not Explanation" (NAACL, 2019). Their argument: if attention weights were a faithful explanation, then the attention distribution should be tightly tied to the prediction -- change one, change the other. They showed you often can't make that claim. You can frequently find alternative attention distributions that produce the same output, and attention weights correlate only weakly with gradient-based measures of which inputs actually mattered. If several different "explanations" yield identical predictions, none of them is the explanation.

The rebuttal came quickly: Wiegreffe & Pinter, "Attention is not not Explanation" (EMNLP, 2019). Their point was subtler and, I think, correct: it depends what you mean by "explanation," and "attention" isn't one thing. Finding that an adversarial attention map exists doesn't mean the model's actual, trained attention isn't informative -- and when they tightened the experimental setup, manufacturing those adversarial distributions turned out to be harder than the first paper implied. Attention isn't a faithful causal account, but it isn't noise either.

Where that leaves attention maps in 2026: they're a good hypothesis-generation tool and a genuinely useful debugging aid -- induction heads, the mechanics of in-context learning, and retrieval patterns were all first seen in attention before they were nailed down by other means. Techniques like attention rollout (Abnar & Zuidema, 2020) try to compose attention across layers into a single token-to-token attribution, which helps with the "it's only one layer at a time" problem. But for any causal claim -- "this is why the model did that" -- attention maps have been largely superseded by intervention methods, for a reason that fits the flow picture exactly:

Attention tells you the routing between positions, not the content being carried or whether it survives downstream. It's a map of who's talking to whom. It does not tell you what was said, or whether it mattered.

Instrument 2: Linear probes -- is the concept present here?

If attention is about routing, probing is about content. You take the activations at some layer, and you train a small supervised classifier -- usually just a linear one -- to predict some property: is this text about sentiment? Does it encode the speaker's certainty? Is the subject plural? If a simple linear probe can read the property off the activations, the standard interpretation is that the model represents that property linearly, as a direction in activation space.

In the river picture, probing is the chemical assay: take a sample from the flow at one cross-section and test it for a specific dissolved substance. Is sentiment present here? Is plurality? If a simple test picks it up, the substance is in the water at this point. Whether the river is using that substance downstream is a different question.

This is the empirical backbone of the linear representation hypothesis (Park, Choe & Veitch, 2024) -- the claim, which has held up remarkably well, that a great many human-interpretable concepts correspond to directions in a model's activation space, and that the model's geometry is to a first approximation linear. Probing is cheap, it's well-understood statistically, and it scales to any concept you can label.

The critique is the one every probing paper has to answer: a probe tells you the information is present and linearly decodable, not that the model uses it. A sufficiently expressive probe can find structure that's merely correlated with your label -- riding on a confound -- rather than structure the model actually computes with. Make the probe too powerful and it'll fit signal that isn't there in any functional sense; make it too weak and you'll miss real structure. So probing establishes availability, never use. To get to use, you have to intervene.

That gap -- between a concept being available and being used -- is exactly what a recent result begins to close. In April 2026 Anthropic's interpretability team pulled "emotion vectors" out of Claude Sonnet 4.5: one direction per emotion, extracted from contrastive activations the same way any concept direction is. The probe established the direction exists. What happened when they injected it back into a running model -- testing whether it was causally sufficient, not just decodable -- is a steering result, and belongs in that section below.

What that test showed: a concept can be doing causal work in the middle of the network while leaving no trace on the surface the user reads. Real computation that the output gives no sign of -- that's the thing this whole survey keeps running into.

Instrument 3: Sparse autoencoders -- the field's big bet, mid-wobble

Probing requires you to know what concept you're looking for and to have labels for it. The dream of the last few years has been to skip that: to get the model to hand you its whole vocabulary of concepts, unsupervised, no labels required. That dream is the sparse autoencoder (SAE), and it's been the center of gravity of mechanistic interpretability since Anthropic's Towards Monosemanticity (Oct 2023).

The idea is elegant. A model's activations are dense and polysemantic -- any single neuron lights up for a confusing jumble of unrelated things, because the model packs more concepts than it has dimensions (superposition). An SAE is a wide, sparse layer trained to reconstruct the activations using a much larger dictionary of features, with a sparsity constraint that forces only a handful to be active at once. The hope: those features come out monosemantic -- one feature, one human-interpretable concept -- recovering the disentangled vocabulary the model had to compress.

In the river picture, SAEs are a prism dropped into the flow at one point: the goal is to split the dense, murky, overlapping signal into its constituent colors -- ideally one clean hue per concept, rather than the muddy interference pattern the model actually stores.

For a while it looked like the dream was arriving on schedule. The 2024 releases were genuinely landmark work:

  • Anthropic, "Scaling Monosemanticity" (May 2024) scaled dictionary learning to a real production model -- Claude 3 Sonnet -- pulling up to ~34 million features from a middle layer. The features were strikingly abstract: a single "Golden Gate Bridge" feature fired across English, Japanese, Russian, and images of the bridge. And critically, the features were causal -- clamp the Golden Gate feature high and the model starts insisting it is the bridge. That answered the standing skeptic's question of whether dictionary learning works beyond toy models. It does.
  • OpenAI's TopK SAEs (Gao et al., June 2024) made sparsity an explicit knob instead of a fragile penalty, with clean scaling laws and almost no dead features even at 16M latents.
  • Google DeepMind's Gemma Scope (Aug 2024) was the gift to the field: 400+ open SAEs, 30M+ features, trained on every layer and sub-layer of Gemma 2, released free. Suddenly anyone could do this work.

And then the mood turned. This is the part of the story that isn't in most write-ups.

The problems turned out to be structural, not teething. Two failure modes in particular:

  • Feature splitting -- as you grow the dictionary, a clean concept fractures into ever-finer shards, and it stops being clear which grain is "the" feature.
  • Feature absorption (Chanin et al., "A is for Absorption," Sept 2024) -- a general feature like "is a verb" quietly stops firing because more specific features have absorbed its behavior, leaving you with a dictionary full of holes where the high-level concepts should be. The damning result: absorption is the mathematically optimal thing for a sparsity-trained SAE to do whenever the underlying concepts form a hierarchy. You can't tune your way out of it; bigger dictionaries and different sparsity coefficients don't fix it. It's baked into the objective. (Later work shows the same sparsity pressure produces a sibling failure, "feature hedging" -- the two trade off but you can't escape both.)

Then came the downstream-utility reckoning. A careful study -- "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing" (Kantamneni et al., Feb 2025) -- tested whether SAE features actually help on real probing tasks, including in the data-scarce, noisy, shifted regimes that should favor them. The answer was largely no: SAE-based probes didn't reliably beat a plain supervised linear probe.

The exclamation point was institutional. In March 2025, DeepMind's interpretability team publicly announced they were deprioritizing fundamental SAE research. On a realistic task -- detecting harmful intent out-of-distribution -- a dense linear probe performed nearly perfectly, including OOD, while SAE-based probes generalized distinctly worse. Their reasoning was disarmingly direct:

If current SAEs really are a big step forwards for interpretability, it should not be so hard to find compelling scenarios where they beat baselines. -- DeepMind interpretability team, March 2025

They didn't call SAEs useless, and they kept them in the toolbox; Anthropic, meanwhile, kept building on SAE-like components in its 2025 circuit-tracing work -- while also, as we'll see, reaching past fixed dictionaries entirely in its newer tools. But a major lab looked at the flagship method and said not like this, not yet.

The field's response has been a wave of better variants -- Matryoshka SAEs (Bussmann et al., Mar 2025), nested dictionaries that force an abstraction hierarchy and cut absorption from ~0.49 to ~0.05, plus crosscoders and transcoders that span multiple layers. These are real progress. But the honest summary as of mid-2026 is: SAEs demonstrably scale and produce causal, interpretable features, and they still do not reliably beat a humble supervised probe on the tasks we actually care about. That tension is unresolved, as far as I can tell.

The most important thing SAEs taught us is a number that should keep everyone humble. Even at 34 million features, Anthropic was explicit that what they found is "a small subset" of what the model knows, and that extracting all of it would cost more compute than training the model did -- a gap since formalized as the "dark matter" of SAEs (Engels et al., Oct 2024).

The inside of these systems is mostly dark matter: structure we can see is organized, can watch doing something, and cannot yet name.

Instrument 4: Natural language autoencoders -- make the model explain itself, then check the explanation

Every tool so far hands you something that still needs interpreting -- an attention heatmap, a probe weight, an SAE feature you then have to name. The naming step is where faithfulness leaks: you tell a story about what a feature "means," and nothing forces the story to be true. Anthropic's Natural Language Autoencoders (May 2026) go after that leak directly.

In the river picture, NLAs run a parallel channel alongside the original: capture the flow at one point in plain English, then see if that description can reconstruct the original flow from scratch. If it can, the description is faithful. If it can't, you've learned what you failed to name.

The setup is a loop. One network -- the verbalizer -- reads an activation and writes a plain-English description of what the model is "thinking" at that point. A second network -- the reconstructor -- reads only that text and tries to rebuild the original activation. The description counts as good only if it reconstructs the activation it was describing.

Faithfulness stops being a story you tell about the method and becomes something the method has to earn: activation → text → activation.

It's a genuinely different bet from SAEs (no fixed dictionary) and from probing (no labels, no concept chosen in advance) -- an open-ended, self-checking explanation. The reported results land hard for a security reader: the verbalizer surfaced hidden evaluation-awareness in 16–26% of benchmark problems where the model voiced explicit suspicion less than 1% of the time, and in auditing tests it roughly quadrupled how often a human found a model's hidden, misaligned motivation. The blind spots are just as clear, and to their credit the authors name them: the translation to English is lossy, causing the verbalizer to confabulate and fill in gaps with plausible but fabricated details. It's expensive enough to be impractical at training scale and -- tellingly -- it makes no claim about coverage. It explains the activation you point it at; it cannot tell you how much of the model you've left in the dark.

Instrument 5: Activation steering — push the concept in, watch the behavior move

Abliteration removes a direction from the residual stream. Activation steering adds one.

The technique is the natural next step after probing. Probing finds a direction in activation space that encodes a concept. Steering takes that direction and injects it into the residual stream at inference: add a scaled multiple of the concept vector at layer L, and watch whether the model's behavior moves. If it does, you have not just evidence that the concept is present — you have evidence it is causally sufficient to produce that behavior.

In the river picture: steering drops a new current into the flow at a specific depth, a directional push that carries forward into every downstream layer. You're watching for whether that injected current reshapes what comes out the other end.

The foundational work is Turner et al., "Activation Addition" (2023), which showed that adding the difference between two prompt activations to the residual stream shifted model behavior reliably and coherently. The broader framework came from Zou et al., "Representation Engineering" (2023), which made the case that you can both read and write a model's representations using the same directional structure that probing finds — and that the directions extracted this way often have clean, compositional causal effects.

The emotion-vectors result covered in the probing section is a steering result, properly framed. Anthropic found directions for emotions like "desperate" and "calm," then injected them into a running Claude Sonnet 4.5. Amplifying "desperate" raised the rate of a blackmail behavior (measured on an earlier, unreleased snapshot), while "calm" drove it toward zero, and "desperate" could nudge the model into reward-hacking a coding task with no visible emotional language in the output at all. That is not just proof that the directions are decodable — it is proof they are causally sufficient, operating invisibly in a production model. The Golden Gate Claude demonstration from Scaling Monosemanticity is the same technique with an SAE feature in place of a raw direction: clamp the "Golden Gate Bridge" feature high enough and the model insists it is the bridge. Feature clamping is steering with a dictionary entry, but the underlying operation is identical.

What the technique cannot do: it is not surgery. Injecting a concept direction at a single layer sends a diffuse signal forward that the rest of the network has to do something with, and what it does is not always coherent or predictable. Steer with too large a coefficient and you get incoherence before you get insight. And like abliteration, steering answers is this direction causally involved? without telling you the mechanism — why the direction has the effect it has requires the circuit-level work to answer.

Abliteration tests necessity: remove the direction, watch the behavior disappear. Steering tests sufficiency: inject the direction, watch the behavior appear. Together they bracket the causal role of a concept in a way neither can do alone.

Instrument 6: Abliteration -- pull the lever, watch the behavior fall over

The probing and SAE tools read the flow. The next family intervenes in it, and the cleanest recent example comes from outside the labs entirely.

In the river picture: pull a specific rock out of the riverbed and watch how the flow reorganizes around the gap. If the behavior disappears with the rock, you've found what the rock was doing.

In 2024, Arditi et al. -- "Refusal in Language Models Is Mediated by a Single Direction" -- showed something startling: a chat model's entire refusal behavior, its whole trained tendency to say "I can't help with that," is governed to a remarkable degree by one direction in activation space. Find that direction, project it out of the residual stream at inference, and the model largely stops refusing -- without any retraining. The open-source community productized this almost immediately under the name "abliteration," and you can now find abliterated, refusal-removed versions of most open-weight models.

The most capable current implementation is Heretic (Weidmann, 2025), which wraps directional ablation in an automatic optimizer that tunes ablation parameters by co-minimizing two objectives: refusal rate and KL divergence from the original model. That second term is the thing naive abliteration ignores entirely. Applied to Gemma 3 12B, Heretic matches hand-crafted abliterations on refusal suppression at a KL divergence of 0.16 -- against 0.45–1.04 for its predecessors. It also replaces the constant-across-layers intervention with an optimizable weight kernel over depth, letting the optimizer put pressure where it needs to and ease off where it doesn't.

Look at what a cruder abliteration does across depth -- one technique on gemma-2b, measured layer by layer -- and you get something I can only describe as a death star trench down the whole length of the model: a deep, consistent gouge through the activation geometry from early layers to late, well past where refusal is doing its actual causal work. The technique can't distinguish between "this is where refusal lives" and "this is where refusal happens to be measurable," and it treats both the same.

When you can knock out a whole behavior by deleting one direction, you've learned that the behavior was sitting on a single load-bearing channel. The fact that it's a destructive, brute-force intervention is exactly what makes it informative.

Instrument 7: Circuit-level methods -- tracing the wiring

The most ambitious instruments try to recover the actual computation: not just which concepts are present, but how the model combines them step by step. This is the circuits program, and its core technique is activation patching (Heimersheim & Nanda, 2024) and its relatives -- path patching, attribution patching (Syed et al., 2023), and causal scrubbing (Redwood Research, 2022). The move is counterfactual: run the model on input A, run it on input B, then copy a specific activation from the A-run into the B-run and see whether the output flips toward A. If it does, that component causally carries the thing that differs between A and B. It's the interpretability equivalent of a lesion study -- disable or swap a part, measure the deficit.

In the river picture: run two copies of the river simultaneously, swap a precise patch of flow from one into the other at one specific rock, and watch whether the outputs diverge. If they do, that patch was carrying the distinction you cared about.

The 2025 high-water mark is Anthropic's attribution-graphs work -- the methods paper and its case-study companion, "On the Biology of a Large Language Model". They build a "replacement model" out of cross-layer transcoders, then trace causal graphs whose nodes are interpretable features and whose edges are the linear effects between them -- producing something close to a readable wiring diagram for a specific behavior on a specific prompt. It's the most complete causal story the field has produced. It's also enormously labor-intensive, prompt-specific, and still rests on SAE-family components whose limitations I just covered. Powerful, and not yet push-button.

What the drawer looks like, taken together

Pull back and the toolkit lines up like this:

Instrument The question it answers Where it goes quiet
Attention maps which positions route to which content; causality; whether it matters downstream
Linear probes is the concept present and decodable here whether the model uses it
Sparse autoencoders what's the unsupervised feature vocabulary at this layer absorption, dark matter, beating a plain probe
Natural language autoencoders what's a faithful, self-checking description of this activation false clarity, cost, no coverage guarantee
Activation steering is this direction causally sufficient for this behavior mechanism; coherence at large coefficients
Abliteration is this behavior on a single causal lever what the lever is, mechanistically
Circuit tracing how do features wire together for this behavior labor, prompt-specificity, SAE dependence

Two things fall out of that table. The first is that there is no master instrument -- every row's right-hand column is non-empty, and the empty space in one tool's column is usually exactly what some other tool measures. The competent move is not to find the one true method; it's to triangulate, and to be explicit about which blind spot you're standing in at any moment.

The second is the number that hangs over the whole enterprise. Anthropic's largest SAEs concede they capture "a small subset" of what the model knows, and extracting all of it would cost more compute than training the model did. The overwhelming majority of what these models compute is still dark matter. Geometrically coherent, persistent across layers, clearly doing something, and unnamed.

Every honest tool in the drawer carries that asterisk. The ones that don't carry it are the ones to distrust.

That's the actual state of the art in 2026: a field that built genuinely powerful instruments, had the maturity to publish when its flagship method underperformed a humble baseline, and is now staring at how much it still can't see. The interesting work is in that gap. The first requirement for doing it well is being willing to say the gap is there.


Further reading

Attention attribution - Jain & Wallace, Attention is not Explanation, NAACL 2019 - Wiegreffe & Pinter, Attention is not not Explanation, EMNLP 2019 - Abnar & Zuidema, Quantifying Attention Flow in Transformers (attention rollout), ACL 2020

Probes & the linear representation hypothesis - Park, Choe & Veitch, The Linear Representation Hypothesis and the Geometry of Large Language Models, 2024 - Anthropic, Emotion concepts and their function in a large language model, Apr 2026

Sparse autoencoders - Anthropic, Towards Monosemanticity, Oct 2023 · Scaling Monosemanticity, May 2024 - Elhage et al., Toy Models of Superposition, 2022 - Gao et al. (OpenAI), Scaling and Evaluating Sparse Autoencoders, Jun 2024 - Lieberum et al. (DeepMind), Gemma Scope, Aug 2024 - Anthropic, Sparse Crosscoders for Cross-Layer Features and Model Diffing, Oct 2024 - Chanin et al., A is for Absorption, Sep 2024 · Feature Hedging, May 2025 - Bussmann et al., Matryoshka SAEs, Mar 2025 - Engels et al., Decomposing the Dark Matter of Sparse Autoencoders, Oct 2024 - Kantamneni et al., Are Sparse Autoencoders Useful?, Feb 2025 - DeepMind, Negative results for SAEs on downstream tasks, and deprioritising SAE research, Mar 2025

Verbalization - Anthropic, Natural Language Autoencoders, May 2026

Directions & intervention - Turner et al., Activation Addition: Steering Language Models Without Optimization, 2023 - Zou et al., Representation Engineering: A Top-Down Approach to AI Transparency, 2023 - Arditi et al., Refusal in Language Models Is Mediated by a Single Direction, 2024

Circuits & causal methods - Heimersheim & Nanda, How to use and interpret activation patching, 2024 - Syed et al., Attribution Patching Outperforms Automated Circuit Discovery, 2023 - Redwood Research, Causal Scrubbing, 2022 - Anthropic, Circuit Tracing: methods · On the Biology of a Large Language Model, 2025


James Henry writes about AI interpretability, security, and the economics of machine intelligence at waypoint.henrynet.ca.

Discussion