NIST has published a mathematical argument for a security lesson many AI teams have already been learning the hard way: fixed guardrails are not enough to protect deployed AI systems from adaptive prompt attacks.
In a June 9 notice, the National Institute of Standards and Technology described a peer-reviewed proof by senior scientist Apostol Vassilev that applies the logic of Kurt Godel’s incompleteness theorems to AI security. The practical claim is not that AI defenses are pointless. It is that no finite set of model rules, refusal policies, filters, or surrounding controls can be treated as permanent protection against every adversarial prompt.

What NIST is actually warning about
The proof focuses on adversarial prompts: inputs crafted to make an AI system ignore, reinterpret, or work around the instructions and controls meant to constrain it. In consumer chatbots, that may look like a jailbreak. In business systems, the risk is broader because models are increasingly connected to files, email, calendars, code repositories, internal databases, browser sessions, and other tools.
NIST’s framing matters because it pushes prompt injection out of the category of one-time model-hardening work. A vendor can add refusal behavior. A platform team can put filters around inputs and outputs. A security team can block known jailbreak strings. But attackers can keep changing the language, context, encoding, and chain of instructions they use. Human language gives them an enormous search space.
That is especially relevant for AI agents. In a separate request for information earlier this year, NIST’s Center for AI Standards and Innovation called out risks that arise when model outputs are combined with software actions. The examples included indirect prompt injection, data poisoning, insecure models, and agent behavior that harms security even without an obviously malicious input.
Why the old patching mental model breaks down
Traditional software security often starts with a defect: a memory corruption bug, an exposed endpoint, a bad permission check, or a vulnerable dependency. Teams can reproduce the bug, patch it, ship a fix, and track whether affected systems have been updated. AI security still has software bugs, but prompt attacks add a less deterministic layer.
A guardrail can fail because of the way a request is phrased, because malicious instructions are hidden in retrieved content, because an agent is asked to summarize data that contains commands, or because one tool’s output becomes another tool’s trusted input. The same model behavior that makes a system flexible and useful can also make it hard to prove that a control will hold under every future interaction.
That does not mean organizations should give up on guardrails. It means they should treat them like living controls. NIST’s notice points to three operational responses: continual red-teaming to find new adversarial prompts, continuous updates to harden deployed defenses, and resilience planning for cases where an exploit succeeds.
What this changes for AI teams
For companies deploying AI systems, the immediate takeaway is that “we tested the model” is too weak as a long-term security claim. The useful question is whether the organization can keep testing the full system as its prompts, tools, models, permissions, retrieval sources, plugins, and business workflows change.
That shifts attention toward practical controls around the model. Teams should know which AI systems are deployed, what data they can reach, which actions they can trigger, which external sources they ingest, and which logs can explain a bad decision after the fact. Agent access should be scoped like any other privileged software identity, with separate permissions for reading, writing, sending, deleting, purchasing, changing code, or touching production systems.
Prompt-injection testing also needs to include indirect paths. A model connected to email should be tested against malicious email content. A coding assistant should be tested against hostile repository text, issue comments, package metadata, and documentation. A research or browser agent should be tested against web pages that try to override user instructions, exfiltrate context, or push the model into unsafe tool use.
The same logic applies to vendors. Enterprise buyers should ask how often a provider updates jailbreak defenses, whether it red-teams tool-using agents rather than only base chat behavior, how it handles reported prompt-injection paths, and whether customers can monitor or restrict high-impact actions. Claims about alignment or safety are less useful than evidence of runtime controls, incident response, and measurable reduction in exploitability.
The agent security race is already underway
NIST has been moving toward this agent-specific view for months. Its AI Agent Standards Initiative, announced in February, is aimed at standards, protocols, security, identity, and interoperability for agents that can take autonomous action. The guardrail proof gives that agenda a sharper technical edge: agent safety cannot be reduced to a static policy file or a model card.
For security teams, the better benchmark is operational. Can they discover AI agents on the network? Can they see what tools those agents can use? Can they revoke permissions quickly? Can they detect strange chains of actions? Can they reconstruct what external content influenced a model’s output? Can they recover when an agent sends the wrong email, exposes data, commits bad code, or changes a configuration?
NIST’s proof does not hand attackers a new technique. Its value is more uncomfortable: it removes the illusion that any fixed defensive layer can settle the problem. AI systems can still be made harder to attack, but the work looks less like installing a permanent shield and more like running a security program that keeps learning as fast as the attackers do.