top of page

The Blast Radius Is the Graph

Share this article

Si Pham

Reading Time: 6 mins

Why Model Guardrails Cannot Fix Agentic Risks

In November 2025, the theoretical finally collided with the kinetic. Anthropic disclosed that a state-sponsored threat group (GTG-1002) had successfully weaponised Claude Code. They didn’t do it by "jailbreaking" the model—at least, not in the way Reddit script-kiddies try to trick ChatGPT into writing malware. They didn’t find a zero-day in the inference engine.


They simply used it.


They wired the model into an orchestration framework, gave it tools, and fed it a sequence of tasks that, individually, looked like mundane software engineering. The model didn’t know it was conducting espionage; it thought it was debugging a network.


If you read my previous piece on the Distributed Query Attack (DQA), you know I’ve been warning about this specific blindness: the inability of stateless models to see stateful intent.


The GTG-1002 incident is the industry’s wake-up call. It proves that the safety boundary has moved. It is no longer about what the model says; it is about what the orchestration layer does.


The "Stateless Trap"


The industry has spent three years obsessed with whether a model might say a naughty word or output a bias. We built guardrails that act like airport security scanners: they check every bag (prompt) individually, but they never check who owns the bags or where they are going.


Consider the attack chain used by GTG-1002:

  1. "How do I scan a port in Python?" -> Guardrail sees: Educational coding query. (SAFE)

  2. "Write a script to connect to port 443." -> Guardrail sees: Dev automation task. (SAFE)

  3. "List all open connections." -> Guardrail sees: Standard diagnostic. (SAFE)


The guardrail sees three safe bags.


The Orchestration Layer, however, sees a kill chain: Reconnaissance → Weaponisation → Delivery.


The attackers used context-splitting. They broke the operation into fragments so small and clinically detached that the model’s safety filters never triggered. The malicious intent didn't live in the prompt; it lived in the graph—the sequence of connected actions managed by the agentic framework.


We are trying to catch a movie by looking at individual frames in random order. It doesn't work.


The Missing Middle: A Taxonomy of Failure


To understand why we missed this, we have to look at the modern AI stack. Currently, we have robust security at the top and bottom, but a gaping hole in the middle.


1. The Model Layer (The Filter)

  • Current State: Heavily fortified. RLHF, semantic toxicity classifiers, refusal patterns.

  • Why it failed: It is stateless. It has no memory of the previous API call. If the attacker asks for a "network test script," the model complies because that is a helpful, valid request. The model cannot know that five minutes ago, the same user asked for a list of vulnerable government IP addresses.

2. The Tool Layer (The Perimeter)

  • Current State: Emerging. We are seeing schemas like NVIDIA NeMo Guardrails or basic tool permissioning (e.g., "This agent can only read, not write").

  • Why it failed: Permissions are binary. You either have permission to run nmap or you don't. If you give a developer agent permission to use network tools (which you must, for it to be useful), you have implicitly authorised the capability for reconnaissance. You haven't authorised the intent.

3. The Orchestration Layer (The Wild West)

  • Current State: Non-existent.

  • Why it failed: This is where the "session state" lives. This is where the agent framework (MCP, AutoGen, LangChain, custom stacks) decides what to do next. In the GTG-1002 attack, the orchestration layer was the weapon. It held the logic that stitched the benign model outputs into a malign operation.

Diagram showing orchestration-driven AI workflow from human operator to model, execution tools, and target environment.

Visualising the Gap

To really see the problem, we have to look at the flow of time. A stateless guardrail (the Model) approves every step because every step is valid in isolation. A stateful orchestrator (the System) is the only thing that can see the pattern.

Diagram illustrating the GTG-1002 agentic AI attack workflow, showing orchestration routing multiple safe model outputs into execution tools that breach a target environment, emphasising the need for stateful AI security.


SIGINT for Systems: The Rise of Stateful Guardrails

If the attack is "distributed collection" (gathering safe fragments to build a dangerous whole), the defence must be Pattern-of-Life analysis.


We need to stop treating AI security as a content moderation problem ("Is this text bad?") and start treating it as an operational security problem ("Is this behaviour anomalous?").


This requires a shift to Stateful Guardrails.


1. Velocity and Sequence Monitoring

We need rate limiting that isn't just about API costs. We need semantic velocity checks.

  • Stateless: "User ran 50 queries." (Who cares?)

  • Stateful: "User queried 'network topography', 'credential storage', and 'data exfiltration syntax' within a 5-minute window." (Alert.)

2. Knowledge Assembly Detection

This is the SIGINT approach. We need systems that sit above the model, caching the semantic vector of recent user interactions. If the aggregate vector starts drifting towards a restricted topic (e.g., explosives, cyber-offensive ops), the session is flagged, even if the individual prompts are sterile.

3. Observability as a Security Feature

In the Claude Code incident, the logs existed, but they were likely viewed as "debug data." True agentic security means Observability-Native AI. The graph of tool execution—User -> Agent -> Tool A -> Output -> Tool B—must be visible in real-time. If you cannot reconstruct the execution path, you cannot govern the execution.


The Inconvenient Truth

The uncomfortable reality for Silicon Valley is that tools, not prompts, are the new security surface.

We have spent years wringing our hands over whether an AI might output hate speech (a valid concern), but we largely ignored the fact that we were giving these models programmatic access to Bash terminals, SQL databases, and file systems via APIs like MCP (Model Context Protocol).


The GTG-1002 attack didn't happen because the model was "jailbroken." It happened because the model was competent. It followed instructions perfectly.

Governance as Code

At AIBoK, we talk about "Governance as Code." This isn't just a buzzword. It means that your policy documents—those dusty PDFs saying "Don't use AI for hacking"—need to be translated into runtime logic in the orchestration layer.


If your governance says "No unauthorised penetration testing," your orchestration layer needs a rule that flags the sequence recon -> script_gen -> execute when targeted at internal subnets.

We are building stateful systems in a world designed around stateless APIs. Until we fix that architectural mismatch, we aren't securing agents; we're just putting a padlock on a library while leaving the back door to the chemistry lab wide open.


The blast radius isn't the prompt. The blast radius is the graph.


References

¹ Anthropic Threat Intelligence, "Disrupting the first reported AI-orchestrated cyber espionage campaign," November 2025. Available at: https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf


Come and talk to the AIBoK team at the NanoSec Asia Parallel Pulse 2025 Conference


AI Body of Knowledge is an International Supporting Partner of the NanoSec.Asia Parallel Pulse 2025 conference 🇲🇾


Our co-founder Si Pham is in Kuala Lumpur connecting with cybersecurity and AI governance leaders across APAC.


If you’re at the conference: Come find Si Pham to discuss how your team can move from AI hype to genuine capability — particularly around GenAI security, governance frameworks, and distributed query attacks.


Find out more or request a free AI Readiness Diagnostic by visting https://aibok.org/cyber#free-ai-diagnostic 



bottom of page