Enter The Concept Allocation Zone (CAZ)

May 26, 2026 — James Henry


My last post argued that mechanistic interpretability is the missing layer in AI security. Activation monitoring isn't in any of the major security frameworks. The model knows what it's building but we're not watching it.

If you want to watch, you need to know where to look.

Two papers I've been working on dropped on arXiv today. Both address the measurement problem: how concepts form inside a transformer, and how to reliably extract a signal from the right layer.


The Concept Allocation Zone (arXiv:2605.24856) maps where in a transformer concepts form.

The naive version of activation monitoring picks a layer--usually somewhere in the middle--and reads activations from there. The CAZ work shows that concepts don't appear at a single layer; they emerge gradually across a contiguous region of the residual stream. I call that region the Concept Allocation Zone.

Scored across 34 models and 8 architectural families, the separation curve is frequently multimodal. There are also "gentle CAZes"--subtle allocation regions that standard detection misses entirely, but that are causally active in 93–100% of ablation trials. They don't look like anything is happening. Something is happening.

If you're probing the wrong layer, you're not getting the signal. You're getting noise.


Geometric Evolution Maps (arXiv:2605.25848) solves the extraction problem.

Concept representations rotate as they move through transformer layers. Fixed-layer probing is unreliable because the representation at layer N is pointing in a different direction than at layer N+5. GEMs tracks that rotation, identifies a "handoff layer" where the representation stabilizes before it reaches the next attention block, and extracts probes there.

Tested across 23 architectures (70M to 14B parameters) and 17 concept types: GEM-extracted probes matched or exceeded peak-layer probe performance in 68.5% of trials. MHA models showed stronger handoff effects (78.3%) than GQA models (47.1%).

A probe extracted at the handoff layer is more stable than one extracted at the traditional peak layer, because you're catching the representation after the model has committed to a direction--not mid-rotation.


Both papers use rosetta_tools v1.3.1. Both are listed at research.henrynet.ca: The Concept Allocation Zone and Geometric Evolution Maps.

Reliable, architecture-consistent probes--knowing where to look and how to extract a stable signal from there--are the prerequisite for the security layer I described last week. That's the next post.


James Henry is a Senior Security Consultant and independent AI/ML interpretability researcher. He writes about security, AI, the gaps between them, and what the AI is actually thinking at any given layer. He publishes at waypoint.henrynet.ca.

Discussion