The Model Is Thinking. We're Not Watching.
In the mid-1990s, network security was perimeter security. You built a firewall, drew a line between inside and outside, and trusted what was on the inside. It was the right response to the threat model of the time: attackers were mostly coming in from outside, and blocking the door kept most of them out.
Then organizations started getting breached by attackers who were already inside. The perimeter held. The threat got through anyway--through a phishing email, a stolen credential, a trusted vendor. Inside the firewall, there was nothing watching. Firewalls inspect what crosses the boundary; they don't watch what the process does once it's running.
The answer was endpoint detection and response: EDR. Not a better perimeter. A different layer entirely--one that watches internal state, behavioral signals, what the process is actually doing at runtime. The shift took a decade to complete and is still ongoing.
We are building AI security the same way we built network security in 1995.
The major enterprise AI security frameworks of 2025 and 2026 are remarkably consistent. OWASP's Top 10 for LLM Applications, Microsoft's defense-in-depth model for autonomous agents, Red Hat's six-layer agent stack, IBM's runtime security framework, Cisco's AI Defense, Salt Security's agentic security model--they converge on the same architecture:
- Supply chain (model provenance, dependency integrity, trust in your model vendor)
- Network telemetry and observability (OTEL, traffic analysis, logging)
- Identity and authentication (agent identity, workload credentials, least-privilege authorization)
- Input perimeter controls (prompt injection detection, syntactic filtering, LLM-as-judge)
- Agent harness and tool governance
- Internal model state--absent across all of them
- Output perimeter controls (response analysis, LLM-as-judge)
- Infrastructure sandboxing
Syntactic filtering--blocking on words, or phrases, even using another LLM as a judge--to analyze what's going on with the context itself is absolutely solid, and should be as foundational in AI security as the firewall is.
None of the frameworks above, however, include the model's internal state as a defendable layer. The model itself is treated as an opaque, probabilistic engine that you wrap and govern. You inspect what goes in, you inspect what comes out. The stance seems to be that you can't inspect what happens in between. We hear in the news that AI is a 'black box' and that we don't know how it does what it does--but we can see what's going on inside the LLMs, it's just that it's hard.
For a chatbot with no tool access, opacity is tolerable--the threat model is narrow, and perimeter controls cover most of it.
For an autonomous agent running in production--taking actions, executing code, delegating to subagents, handling sensitive data--the threat model is different, and the gap is real and growing.
Mechanistic interpretability (MI) is the technique that would fill that gap. The name is awkward; the idea is not complicated.
Modern LLMs process information as vectors in high-dimensional space, through layers of matrix operations. Each layer produces what's called an activation--a representation of what the model has processed so far. By the time the model produces an output, it has passed through dozens of these layers. The activations at each layer encode, in compressed form, something about the model's internal state: what concepts are active, what direction the computation is heading, what the model is, loosely speaking, "thinking about."
MI researchers have found that you can train lightweight classifiers--probes--to read those activations. Ask a probe: is the model in a jailbreak-adjacent state right now? Is the refusal direction intact? Does this activation pattern match our baseline for "cooperating" or does it look more like "complying while planning otherwise"? The probes are not infallible, but they are answering questions that output monitoring cannot answer at all.
Consider the attacker running sixteen variations of a jailbreak inside one session. The input perimeter allows 16 requests, the output perimeter sees fifteen RLHF refusals and one success. MI sees something different: an internal state drifting toward compliance across all sixteen attempts, the RLHF suppression holding on the output while the activation pattern shows the refusal boundary being mapped. The activation pattern at attempt ten encodes the accumulated context of the previous nine--the refusals are not independent events from the model's perspective. The perimeter sees fifteen independent refusals--all the previous ones succeeded through the perimeter. The activations see one continuous attack.
And none of those fifteen refusals generated a SOC alert. The RLHF held--silently. The model has no mechanism to tell your security team that someone just spent an hour probing it. That signal exists in the activations. It doesn't exist anywhere else.
The academic foundation is solid and recent. HiddenDetect (2025) applies hidden-state analysis to detect jailbreaks in vision-language models without fine-tuning. Probing Latent Subspaces in LLM for AI Security (Singapore HTX, 2025) frames activation-level probing as a preemptive, model-agnostic defense. SAID (Self-Activating Internal Defense) activates intrinsic safety mechanisms through internal probing. LUMIA (2024) uses linear probes layer-by-layer to detect membership inference attacks. Anthropic's interpretability team--sparse autoencoders, circuit tracing, steering vectors--is the foundational work most of this builds on. DeepMind contributed significantly early--their Gemma Scope tooling opened SAE-based interpretability to the broader research community--though in 2025 the team stepped back from full mechanistic reverse-engineering toward what they call "pragmatic interpretability." Whether that signals the current methods hitting a wall or the program itself reaching its limit is an open question.
Commercially, the category is two companies thin. Starseer (Knoxville; founded 2024; funded by Gula Tech Adventures; advisory board includes Rob Joyce, former NSA Cybersecurity Director) built their product line explicitly around interpretability as a security primitive. Their tagline: "Most AI security tools observe outputs and infer intent. We look inside." CEO Tim Schulz, who ran Verizon's AI Red Team and built SCYTHE, describes his entry into MI as going looking for "the AI equivalent of Sysmon or EDR telemetry" and finding mechanistic interpretability. Realm Labs--RSAC 2026 Innovation Sandbox finalist, led by Saurabh Shintre--describes their product as identifying "regions in the LLM where harmful information is stored and monitoring when user queries cause the model to access these regions." Different framing; same layer.
No traditional security vendor--Palo Alto, CrowdStrike, Microsoft, Cisco, SentinelOne--currently ships an activation-monitoring product as far as my research has found.
Consider the scenario that a system is building, over multiple different agentic calls, an exfiltration capability that in no single model run's context, input, or output, is plainly visible. Perimeter checks will not key on these. But in the aggregate, the model knows what it's building toward. The model knows what it's doing, because to achieve the goal it needs to know that it's building an exfiltration capability in order to design and manage that.
Input and output (perimeter) monitoring doesn't catch 'social engineering' of the LLM. It catches what was said to the model and what the model said, not what it was computing toward. In a world of agents running in reinforcing loops producing thousands of lines of code unchecked except by other agents, what the model is thinking is paramount.
The honest caveat before the argument runs too far: activation monitoring can be sidestepped. Research is ongoing into exactly how--and how reliably--but defense in depth is the whole point of a layered stack; being able to evade a single layer should not invalidate the entire security stack.
Regardless, without monitoring the models' internal state, you are blind to a specific class of threat: behavior that is not directly visible in the inputs or outputs but is already present in the internal state.
What's different now is the constantly changing nature of what the models are doing. In 2023, the threat model for a jailbroken LLM was that it would say something it wasn't supposed to say. In 2026, the threat model for a compromised autonomous agent is that it will do something it wasn't supposed to do--take an action, access a resource, exfiltrate data, escalate a privilege--while its outputs remain cooperative and legible throughout.
The perimeter controls were designed for the first threat model. The second one requires a different layer.
The EDR moment for AI security is arriving. The question is whether the enterprise stack is ready when it gets here.
James Henry is a Senior Security Consultant and independent AI/ML interpretability researcher. He writes about security, AI, the gaps between them, and what the AI is actually thinking at any given layer. He publishes at waypoint.henrynet.ca.
Discussion