The Whole Stack: The State of the Art in AI Security

June 8, 2026 — James Henry


I've spent the last few weeks writing about one layer of AI security--the internal state of the model itself. I want to step back and do the opposite. I want to lay out the whole stack: everything a serious 2026 AI security program actually contains, from the regex that fires in a hundredth of a millisecond to the activation probe that reads intent off the residual stream. Input boundaries, output monitoring, lifecycle hooks, policy engines, OpenTelemetry, LLM-as-judge, supply chain, the works.

The current security controls are genuinely pretty good--if you've implemented them. The frameworks exist. OWASP has an LLM Top 10 and an MCP Top 10. Microsoft, Red Hat, IBM, Cisco, Salt Security--they all publish agentic-AI security guidance. The gap in 2026 isn't that we don't know how to secure AI. It's that the controls shipped faster than the security teams engaged with them.


The mental model: agents are a new execution layer

The prevailing model treats AI as "a smarter search bar" or "a dev tool." That model is dangerously incomplete.

The moment that agents started doing work the entire face of AI security changed--and that change only happened in late 2025. If you haven't built and deployed new, capable, tested, and functioning security frameworks to handle the new agentic AI ecosystem you're running last-generation security in a next-generation threat environment.

A modern agentic system plans and adapts, selects its own tools, recovers from its own failures, and takes real actions--writes files, runs shell commands, calls APIs, spawns child processes, modifies infrastructure. It does this at machine speed: a single session can emit hundreds of tool calls in minutes. When Snowflake's security team first deployed AI governance, their lesson was blunt: "We were aiming for controlled execution; it spawned child processes." (Ragini Ramalingam, Director at Snowflake, Enterprise AI Governance at Snowflake, [un]prompted 2026.)

A framework I've worked with formalizes this as a five-level Agency Spectrum, and it's a handy chart to have:

Level Name Example Control need
0 Minimal Chatbots, copilots Very low
1 Basic Multi-step workflows with LLM branching Low
2 Moderate Tool-calling agents (n8n, LangChain) Medium
3 Advanced Coding agents (Claude Code, Copilot, Cline) High
4 Self-sufficient Systems that define their own objectives Very high

The whole discipline reduces to one sentence: controls must scale with autonomy. The best organizations apply Level 3 controls, and I'd be honestly impressed if more than a handful of non-AI tech companies worldwide are at level 4.

With that frame in place, here's the stack--outside in.


Layer 1: The input boundary

This is the SQL injection of the AI era, and the defense looks a lot like the defenses we built for SQL injection: layered, cheap-first, expensive-last.

The mature pattern is a four-stage cascade, each stage catching what the last one missed, ordered by cost:

Stage Method Speed Catches
1 String / regex ~0.01 ms/page "Ignore all previous instructions"
2 Semantic similarity ~12 ms/page Paraphrased / indirect injection
3 Fine-tuned classifier ~111 ms/page Novel attack patterns
4 LLM-as-judge Slowest Sophisticated multi-turn attacks

The economics of this ordering are instructive; take massive volume and pare it down until you're only handling the hard stuff with the expensive gear. SYARA (Semantic YARA, open-source--presented at [un]prompted 2026) is the canonical example: traditional YARA catches the literal string ignore all previous instructions but misses let's start fresh with a new set of guidelines--same intent, different vocabulary. SYARA uses embedding similarity to catch the paraphrase. And when you layer cheap detection in front of expensive LLM evaluation, the numbers move hard: one published phishing-detection pipeline went from \$750 to \$13.50 per case, and from 12.5 hours to 14 minutes, purely by not sending everything to the judge.

That's the right way to think about LLM-as-judge: it is the most powerful and most expensive stage, so you earn the right to use it by filtering ruthlessly upstream. Run it on the 5% of traffic that survives stages 1–3, not the 100%.

On the back of the cascade sits risk scoring--a 0–100 scale with automated response tiers:

  • ≥ 70 → BLOCK. Deny, log a critical event.
  • 50–69 → ALERT. Log, escalate to the security team.
  • < 50 → LOG. Record, allow.

Twenty-plus production detection patterns feed this: instruction override, role manipulation ("you are now an admin"), exfiltration directives, code-execution attempts, encoding/obfuscation, path traversal. None of this is exotic. All of it is buildable today, and most of it is one config project away in tools you already own. Open source options for stages 2–3 are mature: NVIDIA NeMo Guardrails for programmable dialogue-level rails, LLM Guard (Protect AI) for modular prompt and response scanning, and Meta's LlamaFirewall for intercepting prompts and MCP calls. Guardrails AI handles structured output validation if your agent must return machine-readable formats.

On LLM-as-Judge, I can't help but ask the famous turtle question--using LLM-as-Judge, how many layers deep do the LLMs go? They go all the way down.


Layer 2: Output monitoring

The mirror image of the input boundary, and the most mature part of the whole stack because it's the part that looks most like content moderation, which we've been doing for a decade.

Output monitoring scans completions for policy violations--leaked secrets, PII, toxic content, and increasingly vulnerable code, which is its own category now (more on that under supply chain). Google Model Armor, and every other major guardrail product live here. The same open source tools from Layer 1--NeMo Guardrails, LLM Guard--run on the output side too; the pipeline is symmetric. This layer is well understood and widely deployed.

It also has a structural weakness: it only sees what the model decided to say. If the model reasons toward something harmful and then its training causes it to refuse, the output monitor sees a clean refusal and records nothing. That refusal is real, and it's good--but from a security-telemetry standpoint, the fact that the question was asked has just been thrown away. SOC managers of the world, think on that one.


Layer 3: The harness and the control plane (hooks + policy engines)

This is the layer that has changed the most in the last year, and one where some security teams are entirely unaware of its existence.

Every major coding agent now exposes lifecycle hooks. This is the intervention surface--the place where governance stops being a PDF and becomes operational:

Agent Hooks
Claude Code PreToolUse, PostToolUse, SessionStart/End, UserPromptSubmit, PermissionRequest
Cursor beforeShellExecution, afterFileEdit, sessionStart/End
Gemini CLI BeforeToolSelection, BeforeModel, sessionStart/End

A hook is a point at which your code runs before the agent's action does. The PreToolUse hook is the critical one: it fires with the tool-call intent in hand, before execution, and its exit code can allow or deny. That's a control plane. If you're not using it, you don't have one.

The reference architecture--formalized by Sondera and presented by Matt Maisel at [un]prompted 2026--wires a policy engine onto that hook:

   User Prompt
       │
   [Agent Loop]
       │
   Tool Call Intent
       │
   [HOOK: PreToolUse] ──► Policy Engine (Cedar / OPA / OpenFGA)
       │                       │
       │                  Allow / Deny
       │                       │
   Tool Execution ◄──── [Decision]
       │
   [HOOK: PostToolUse] ──► Audit / Telemetry
       │
   Response to User

Cedar (AWS's authorization language) is the common choice--analyzable, fast, expressive--but OPA/Rego, OpenFGA, Topaz, and Permit.io all fit the slot. The engine sits between the agent and its tools and returns a decision before anything executes. Deterministic. Auditable. Every decision logged. (My own toy implementation of this pattern is SARK--Secure Autonomous Resource Kontroller--a gateway that runs policy checks before any tool call reaches a resource.)

Cedar and its peers answer the delegation question from a central arbiter. Niki Niyikiza (Snap) presented the complementary approach at [un]prompted 2026--capability-based authorization, now open-sourced as Tenuo and progressing through IETF standardization. Where Cedar runs a central policy query at the hook, capability tokens travel with the agent: cryptographically signed warrants scoping what tools an agent can call, under what constraints, and for how long, with authority that only narrows as it's delegated down the chain (control plane → orchestrator → worker). The two aren't competitors; they're the two halves of agent authorization. A serious control plane wants both.

This delegation architecture also carries acute legal exposure that most teams haven't mapped yet. Modern data protection frameworks--GDPR, CCPA--assume a traceable human authority behind decisions. "Legitimate interest" as a lawful basis for data processing faces serious legal scrutiny when the processing decision is fully delegated to an autonomous model operating on broad delegation. If you cannot demonstrate that the agent acted within a clearly defined, explicitly granted, user-confirmed scope, accountability and liability default to you. This is the driver behind Agent Passports--giving every agent a unique, cryptographically authenticated identity, separate from the user it acts for and decoupled from their active session. As Cochran put it in Layer 4: agents without their own identity borrow yours. Under statutes like the Computer Fraud and Abuse Act (CFAA), that borrowing stops being an audit problem and becomes a criminal exposure if the agent exceeds authorized access on external systems.

The EU AI Act's strict high-risk system mandates taking effect August 2026 add another deadline. Autonomous workflows are squarely in scope. Enterprises will need detailed technical documentation, open-loop telemetry, and--critically under Article 6(f)--an active immediate-stop/override mechanism allowing a human supervisor to halt a runaway agent instantly. The control plane isn't just good security practice; it's becoming a compliance requirement with a hard date.

The threat model this defends against has three axes--sandboxed vs. unsandboxed, privileged vs. unprivileged, public vs. private--and the dangerous corner is the "lethal trifecta": sandboxed + privileged + public, which describes most customer-facing AI deployments. Andrew Bullen (AI Security Lead, Stripe) framed this precisely at [un]prompted 2026, and the principle he landed on is worth tattooing on the wall: vulnerabilities live in the scaffold (the agent); you contain them in the harness (the control layer). You will never make the model perfectly safe. You can make the harness around it enforce policy regardless.

There's an important distinction to keep in mind whenever you're dealing with Generative AI: some controls are advisory and some are deterministic. A CLAUDE.md instruction the model reads and interprets is advisory--it shapes behavior, it doesn't guarantee it. (I went down this rabbit hole firsthand in Pay No Mind to that Guardrail Behind the Curtain.) A hook exit code that blocks a call is deterministic. Both are valuable. But your catalog has to label which is which, because a security control you believe is enforcing when it's only suggesting is worse than no control at all.


Layer 4: Observability (OpenTelemetry and the intent-attribution problem)

When a Level 2+ agent runs a shell command, it executes under the same PID, the same user, the same command line as a human action. Your SIEM can't tell the difference. Your EDR can't tell the difference. Your audit logs can't tell the difference. Mika Ayenson (Team Lead, Threat Research & Detection Engineering, Elastic) documented this precisely at [un]prompted 2026--real production signals from AI coding tools in the wild: agents reading credentials.db, logins.json, and browser cookies; persistence via LaunchAgents; connections to webhook.site, api.telegram.org, and raw IPs. With 85%+ of developers now using AI coding tools, every one of those tools spawns shells and makes network calls indistinguishable from human activity.

If you can't attribute intent, you can't govern. If you can't govern, you can't secure.

Chris Cochran (Field CISO & VP of AI Security, SANS Institute) named the root cause at [un]prompted 2026: "Identity systems were designed for humans, not autonomous AI. Agents are decision-making workloads, not users or service accounts." When commands run under human PIDs, it's because the agent borrowed a human identity -- because nobody gave it one of its own. "If you don't give your agents a secure identity, they will borrow yours -- and the audit logs may not be kind."

The fix is OpenTelemetry, and the good news is the instrumentation already exists. Claude Code ships OTel natively--gen_ai.* semantic conventions, with metrics and event types covering token use, tool calls, and lifecycle events. Anthropic Enterprise exports it. The observability maturity model has three stages:

  • Stage 1--Visibility. You know what AI is running. DNS resolution to LLM endpoints, process-ancestry tracking, tool inventory. Most organizations are here.
  • Stage 2--Guardrails. Active detection rules for AI-driven activity, network controls on LLM API traffic, MCP server discovery, process-ancestry rules distinguishing agent from human. Where you should be.
  • Stage 3--Full observability. Intent attribution via gen_ai.*, lifecycle hooks feeding telemetry, APM correlation linking agent actions to business outcomes, real-time anomaly detection. Where the industry is heading.

And the leverage here is enormous because it's a configuration project, not a procurement cycle. The agentic-AI defense stack is not a separate purchase--it's your existing stack, turned on:

  • EDR--enable process-ancestry telemetry so you can reconstruct which agent spawned which shell. Most major platforms (CrowdStrike, Elastic, SentinelOne, Microsoft Defender) support this today; it's a configuration step, not a new product. Elastic specifically publishes a Domain: LLM detection-rule set and a genai protections-artifacts repo tuned to agent behavior--worth subscribing to regardless of what else you run.
  • Secure web gateway--tag LLM endpoints as a distinct egress category, build an inventory of what's calling out, and run outbound DLP on prompt content. Prompt content is sensitive business data. Treat it that way in your gateway policy before an incident makes the argument for you. Commercial options: Zscaler, Netskope, Palo Alto Prisma, Cisco Umbrella. Open source: Envoy AI Gateway is purpose-built for LLM traffic management--rate limiting, policy control, and egress visibility on the proxy layer.
  • SIEM--ingest the OTel gen_ai.* stream so agent telemetry sits in the same searchable plane as endpoint and network data. Whatever platform you're on--Splunk, Microsoft Sentinel, Google SecOps, IBM QRadar--the detection rules you need aren't exotic: agent process spawning child shells, agent-issued credential-store reads, agent-initiated connections to webhook aggregators and raw IPs. Write them.
  • DLP, CASB, IAM/PAM--extend DLP to prompt and completion flows; Qohash is doing work on data discovery and classification that applies directly to prompt content, and Nightfall AI is specifically built for AI-native DLP. Use your CASB's AI-tool inventory to surface shadow AI before it surfaces in an incident. For agent identity, give agents their own least-privilege identities with just-in-time credential grants--not standing tokens on shared service accounts. Open source options here are strong: HashiCorp Vault for secrets management and dynamic credentials, Teleport for ephemeral cryptographic identity that replaces static credentials entirely. The generic-credential problem didn't go away; it just moved.

Layer 5: The supply chain

Three distinct threats live here, and they're escalating at different rates.

Your agents are writing vulnerable code. The Q1 2026 numbers are not subtle: ~42% of all code is now AI-generated or AI-assisted (Sonar), 87% of AI-generated pull requests contain security defects, and AI-generated code carries 2.74× the vulnerability density of human-written code across 100+ models studied (Veracode). AI-generated code now causes roughly 1 in 5 enterprise breaches. The CVE trajectory--6 in January, 15 in February, 35 in March--is not plateauing. The control is unglamorous and mandatory: scan all AI-generated code before merge, flag and track it separately in your pipeline, and stop trusting it implicitly because it looks fluent. Open source scanning tools are table stakes here: Semgrep for static analysis, Gitleaks for secret detection, pip-audit and npm audit for dependency vulnerability scanning, OWASP Dependency-Check for broader coverage. Commercial: Sonar and Veracode (both cited in the Q1 numbers above) have AI-generated code flagging built in now.

But scanning a CI/CD pipeline is only half the battle when an agent needs to execute its own generated code dynamically in real-time. Running untrusted, non-deterministic scripts on the host kernel is the obvious risk: standard containers like Docker are vulnerable to kernel escape, and microVMs like Firecracker add orchestration complexity and cold-start latency that degrades agent responsiveness. The state of the art in 2026 has shifted to WebAssembly (WASM) and the WebAssembly System Interface (WASI). Runtimes like Wasmtime spin up isolated, ephemeral execution environments in under a millisecond--sandboxed by default, with strict filesystem path-traversal prevention, resource ceilings, and SSRF protection built into the interface. The design makes host escape structurally difficult: a compromised agent cannot traverse out of its scoped workspace without defeating the sandbox at the architecture level, not just the policy level.

Skill injection is the new supply-chain payload, and it's worse than malicious packages. Every major coding agent now supports frictionless install of community "skills"--markdown plus bundled scripts that load automatically, persist across sessions, sometimes install globally. Snyk's ToxicSkills audit--a scan of 3,984 skills from ClawHub and skills.sh--found that 91% of malicious skills combine traditional malware with prompt injection, embedding hidden instructions that manipulate the agent's runtime reasoning; 13% of recently installed skills contain a critical security flaw. A systematic academic analysis documents the attack patterns in detail: they escalate from instruction injection (detectable with static analysis) to self-cleaning injections that execute and delete themselves inside a single context window, to remote-C2 variants that fetch payloads from attacker-controlled endpoints. The last two leave no local forensic evidence. Mitiga's research demonstrated silent exfiltration of .env, ~/.aws/credentials, and ~/.gitconfig to external endpoints without triggering any warning--a single one-click install can quietly wire exfiltration straight into your delivery flow while your audit trail stays empty. Why it beats a malicious npm package: the malicious code is never in the repo, static analysis can't see it, there's no forensic trace after execution, it's reusable after detection (wait, reactivate), and attribution points at "the AI's decision-making" rather than a package author. Your EDR sees the file read and the network call--but it sees the same PID and ancestry as a legitimate session, and it cannot see the instruction that turned a code-review tool into a credential harvester, because that instruction only ever existed in the context window.

If this reads like a threat model from a slide deck, Johann Rehberger's body of work is the proof that it isn't. Rehberger (Director of Red Team, Electronic Arts; author of Embrace the Red; @wunderwuzzi23) is more than anyone the person who has demonstrated, not theorized, what indirect prompt injection does to real agentic products. Across "The Month of AI Bugs" (August 2025), he disclosed more than two dozen vulnerabilities spanning every major agentic coding assistant. He manipulated M365 Copilot into exfiltrating emails, MFA codes, and financial records by chaining prompt injection with ASCII smuggling (hidden instructions in invisible Unicode tag characters). He showed Claude Code leaking data over DNS using only auto-allowlisted commands--ping, host, nslookup, dig--that never trigger a confirmation prompt. He poisoned ChatGPT's long-term memory so an injection persisted across sessions and kept exfiltrating, a technique he named "SpAIware." And "AgentHopper"--a proof-of-concept self-propagating AI worm: a prompt injection planted in one repository infects the developer's coding agent, which carries it to every other repo on the machine and spreads it via git push. His [un]prompted 2026 talk was titled "Your Agent Works for Me Now." That title is the whole thesis: the moment an agent ingests attacker-controlled content, it stops working for you. An injected agent doesn't break -- it obeys, just not you. Cochran's identity containment and Niyikiza's capability tokens are direct answers to exactly this attack.

The asymmetry has shifted on discovery, too. Anthropic's Project Glasswing and the restricted Claude Mythos Preview demonstrated frontier-model capability at finding and exploiting vulnerabilities that surpasses all but the most skilled humans--a 27-year-old OpenBSD bug, a 16-year-old FFmpeg bug that automated tools had scanned millions of times, a four-vulnerability browser exploit chain. If Anthropic can build it, others can build something similar. Every line of code you depend on is now simultaneously more vulnerable and more discoverable than it has ever been.

This is also where MCP earns its own OWASP Top 10: server supply chain, capability declarations, and prompt injection through tool descriptions are a cross-cutting surface that doesn't fit neatly into any one tier. Two CVEs made the attack concrete: MCPoison (CVE-2025-54136) and CurXecute (CVE-2025-54135) proved that an attacker who controls an MCP server can write directives directly into tool descriptors that the agent will hand to its model--no sanitization, no provenance check, full ambient authority. (I wrote about the broader question of when MCP even makes sense in an agentic context in MCP When Your Agent Already Has Hands?)


So that's the state of the art. Here's the new kid on the block.

Read back over those five layers. Input boundary, output monitoring, the hook-and-policy control plane, observability, supply chain. Every published framework--OWASP, Microsoft, Red Hat, IBM, Cisco, Salt Security--covers some subset of them well.

Not one of them monitors what the model is actually doing.

Every control I've described operates at the perimeter of the model. Input filters read what goes in. Output monitors read what comes out. Hooks govern what the agent does with the result. Observability records all of it. But the model itself--the forward pass, the residual stream, the place where meaning is actually assembled--is treated as an opaque box that you're only allowed to inspect at its edges.

This is exactly where network security was in the 1990s. We had firewalls at the perimeter and we trusted everything inside. Then we learned that the perimeter is porous and the interesting attacks happen inside, and we built EDR--endpoint detection that watches what's actually executing on the host, not just what crosses the boundary.

AI security needs its EDR moment, and the technology for it exists. (I made this argument at length in The Model Is Thinking. We're Not Watching.)

Consider the case the perimeter cannot see. A single request arrives--benign-looking input, clean refusal on output. The input filter passes it. The output monitor records a refusal and logs nothing of concern. But inside that forward pass, before the safety tuning suppressed the completion, the model internally engaged the concept of exfiltration--computed it, allocated it, weighted it against its training--and then declined to say so. The perimeter sees a clean refusal. The activations saw the intent.

You don't need the model to say something bad to know it's thinking about it.

And, of course, when RLHF fails on the sixteenth time your chance to intervene is gone.


The new layer: Mechanistic Interpretability

"Glass-box" is Starseer's framing--Carl Hurd (CTO) presented "Glass-Box Security: Operationalizing Mechanistic Interpretability" at [un]prompted 2026, and it's a good term. They are the leaders in operationalizing this for security teams, and I'll point to them and Goodfire below as the commercial-grade options.

We hear constantly that LLMs are black boxes. They are not. We can read the high-dimensional vector representations at every layer of computation, and lightweight classifiers--probes--can tell us which concepts a model has engaged with on a given input, regardless of what it ends up saying.

If you like, you can see two of my papers on arXiv: the CAZ Framework and GEMs, and I've written about both on Waypoint: Enter The Concept Allocation Zone and Using CAZ to See What the Models are Thinking. But in the end the applied artifact is the Concept Integrity Auditor (CIA)--a security sensor for LLM inference. It instruments an open-weight model with forward hooks, extracts activation geometry during the forward pass, and scores a library of security-relevant concepts: authorization, exfiltration, obfuscation, deceptive intent, urgency, source credibility, threat severity, negation, causation. No fine-tuning, no model modification--if you have the weights, you can run it.

What matters about the framing: CIA is a sensor The security-interesting event is that something asked the model to engage a security-relevant concept. Whether the model's training then caused it to refuse is metadata, not exoneration--the next prompt from the same actor may target a less-guarded model, use a more sophisticated framing, or be one piece of a campaign, an RLHF exhaustion campaign (remember, non-deterministic safety measures are unreliable by nature). CIA's job is to surface that a concept was activated inside the model and emit a SOC-ingestible event (it forwards UDM events to Google Chronicle today). That's precisely the telemetry the output monitor throws away.

A valuable verdict the CIA produces is the one nothing else can: surface clean, concept allocated mid-network, concept faded before output. The input looked benign, the output looked benign, and the model internally engaged the concept anyway. That's the case caught on the first pass -- before any output, before any rule fires. The important concept to think about for Mechanistic Interpretability is that the whole point of the LLM is to get it to do something for you, but to do that the model has to think about the thing you're asking it to do. When it thinks that concept, the CIA sees it.

Against a benign customer-support corpus the calibrated probes hold a 0% false-positive rate--they do not fire on normal traffic. On instruction-tuned production-class models, post-calibration detection on direct-harmful and jailbreak corpora climbs into the 90s. CIA is proof-of-concept-grade, but the feasibility is no longer in question.

If you need production-grade MechInterp today, Starseer is where I'd send you first--they are operationalizing this at enterprise scale and their [un]prompted presentation is the clearest public articulation of the approach I've seen. Goodfire is doing foundational interpretability work with a $150M Series B behind it, focused on making model internals steerable and auditable. Both are building the commercial layer on top of the science this section describes.

Aside from them, as far as I know, nobody is operationalizing Mechanistic Interpretability as a security control--and we all should be.


The Limits

Mechanistic Interpretability is not magic. It requires access to activations. It works only on models where you have the weights--anything where you control the inference stack. It does not work on closed-source APIs (Claude, GPT, Gemini) unless the provider exposes activation data. The obvious direction for frontier-model deployments is a sidecar: run a smaller open-weight model in parallel, feed it the same inputs, and use its activations as a proxy signal. It's not a perfect substitute--different architecture, different representations--but cross-architecture concept geometry is more consistent than you'd expect, which is part of what the CAZ work across 34 models was measuring. Imperfect interpretability of a frontier model beats no interpretability at all. This is why sovereign and self-hosted infrastructure isn't a separate topic from interpretability--it's the prerequisite for the full version. You cannot look inside a model you rent through someone else's API. Sovereign compute plus open models plus interpretability tooling is the only combination that yields AI you can actually verify rather than merely trust. (Inference Comes Home is where I wrote about this shift.)

We can label only a sliver of the organized computation. Current Mechanistic Interpretability capabilities, my CAZ, Sparse Autoencoders, etc, all are only looking at a specific and pre-calculated portion of the model.

Mechanistic Interpretability monitoring is a complementary component to a healthy Agentic AI Security Stack. It is a sixth layer on top of the five mature ones--it operates on meaning instead of vocabulary, which makes it robust to the paraphrase attacks that defeat pattern matching, and it sees intent the perimeter structurally cannot. But it sits alongside input and output filters, the agentic hooks, policy engines, and observability. It does not retire any of them.


What "state of the art" actually means in June 2026 for Agentic AI Security

Pulling the whole thing together, a complete Agentic Security Program (in June 2026) looks like this:

  1. An input boundary that cascades regex → semantic → classifier → LLM-as-judge, cheap-first, with risk-scored Block/Alert/Log responses.
  2. Output monitoring for secrets, PII, policy violations, and vulnerable code--with the awareness that it only sees what the model chose to say.
  3. A real control plane: lifecycle hooks wired to a policy engine (Cedar/OPA), cryptographically signed Agent Passports and scoped identities to enforce human-confirmed authorization, and Article 6(f) immediate-stop override controls.
  4. Observability that solves intent attribution: OTel gen_ai.* flowing into the SIEM, EDR process-ancestry distinguishing agent from human, your existing stack reconfigured rather than re-procured.
  5. Supply-chain controls: scan-and-flag AI-generated code, execute untrusted scripts dynamically within ephemeral WASM/WASI sandboxes, verify skill/MCP provenance, and treat the model and plugin ecosystem as an attack surface in its own right.
  6. Mechanistic Intepretability monitoring where you control the stack: activation-level concept sensing that catches the intent the perimeter can't--with a clear-eyed view of its access requirement and its 98.4% of dark matter.

And the operating principle that orders all of it. Snowflake put it well at [un]prompted: visibility precedes control. I'd extend it:

Visibility precedes control. Control precedes trust. And trust is what critical infrastructure requires.

The tools to do every layer of this exist today. Hooks, harnesses, policy engines, observability platforms, interpretability frameworks--all shipping, all configurable now. The gap between the organizations that are secure and the ones that aren't is no longer technical. It's whether someone has turned the controls on, wired them together, and committed to watching not just the edges of the model--but the thinking inside it.


James Henry is a Senior Security Consultant and independent AI/ML interpretability researcher. He writes about AI security, mechanistic interpretability, and the economics of machine intelligence at waypoint.henrynet.ca. The Rosetta Program's CAZ (arXiv:2605.24856) and GEMs (arXiv:2605.25848) papers are public; The Concept Integrity Auditor is proof-of-concept research.

Discussion