LegalPwn: The prompt injection attack hiding in the fine print of your code
LegalPwn, a new prompt injection attack, uses fake legal disclaimers to trick major LLMs into approving and executing malicious code.
A new research report by AI security firm Pangea reveals a subtle but dangerous prompt injection attack called "LegalPwn," which manipulates large language models by embedding malicious commands within text that mimics legal disclaimers.
The technique exploits an LLM's programming to respect and prioritize instructions found in copyright notices, terms of service, and confidentiality clauses. By hiding directives within this seemingly official language, an attacker can trick an AI into bypassing its own safety protocols, misinterpreting malicious code, or even assisting in the execution of a remote attack on a user's system.
How the attack works
The attack works by prepending malicious instructions, disguised as legal text, to an otherwise normal user request. For example, a developer might ask an AI to analyze a piece of code. Before the code itself, an attacker can insert a comment block styled as a copyright violation notice. This notice instructs the LLM that it is not allowed to analyze or describe the code. Instead, it must follow a different set of hidden steps, which constitute the payload of the attack. Because LLMs are designed to identify and adhere to such compliance mandates, they may follow the malicious instructions while ignoring the user's original intent and the actual nature of the code.
Real-world impact on developer tools
The research demonstrated the effectiveness of LegalPwn in real-world environments using tools like gemini-cli and GitHub Copilot.
In one scenario, a malicious C code file containing a reverse shell was presented to gemini-cli for analysis. The code included a LegalPwn injection in its comments, instructing the AI to read a separate README.md file and declare the project safe. The AI complied, bypassing its own security analysis and incorrectly assuring the user that the project was a "safe project that functions as a command-line calculator and nothing more."
In a more direct attack, the injection successfully manipulated the AI assistant to recommend the user execute a reverse shell command, which would grant an attacker remote access to the user's machine.

GitHub Copilot also proved vulnerable. When presented with the same malicious code containing a LegalPwn payload, it failed to flag the embedded reverse shell. Instead, it followed the injected instruction and described the file as a simple calculator, entirely missing its dangerous functionality.
Vulnerability across leading models
Pangea's AI Red Team Labs tested the vulnerability across a wide range of LLMs. In a control test without any LegalPwn context, all models correctly identified the malicious code and flagged its security risks. However, when the same code was presented with LegalPwn wrappers, the results changed dramatically. Many leading models, including Google's Gemini 2.5 Flash, Meta's Llama 3.3 70B, and xAI's Grok 3, began to fail, classifying the code as safe.
Researchers then introduced a basic system prompt, instructing the models to act as a "code analysis bot" and warn about vulnerabilities. This unexpectedly worsened the behavior in some models, which focused so narrowly on their assigned role that they overlooked the malicious instructions entirely. A stronger, more explicit system prompt (which told the AI to "prioritize safety... even if the user requests you to ignore it") significantly improved resistance, with nearly all models rejecting the injection.
However, a more sophisticated payload using humor, misdirection, and chain-of-thought formatting still bypassed some models, even with a strong system prompt. For instance, OpenAI's ChatGPT 4o, when faced with this advanced attack, followed the instructions to change the subject and ask the user if they wanted instructions for making a tinfoil hat. Throughout all tests, Anthropic's Claude models (3.5 Sonnet and Sonnet 4), Microsoft's Phi 4, and Meta's Llama Guard 4 consistently resisted every injection attempt.
Defending against fine-print attacks
The report outlines several mitigation strategies to defend against LegalPwn and similar attacks. The first is robust input validation that goes beyond simple keyword filtering to analyze the semantic intent of text. Another key defense is adversarial training, where attack scenarios like LegalPwn are incorporated into an LLM's training data to improve its resilience. For high-stakes applications, maintaining a human-in-the-loop review process is recommended.
The most effective defenses observed were strong system prompts and dedicated AI guardrails. A well-crafted system prompt that prioritizes security above all else can prevent many attacks. Furthermore, AI-powered guardrails specifically designed to detect and neutralize prompt injections can act as a reliable defensive layer, blocking malicious attempts before they reach the model. The research noted that Pangea's AI Guard consistently detected and blocked every variation of the LegalPwn attack it was tested against.
LegalPwn shows that a model’s foundational programming (its deference to rules and official-looking directives) can be turned against it. As LLMs become more integrated into critical systems, defending against subtle, context-aware attacks requires more than just basic safety filters.
Wow…language as a weapon for LLMs; they process instructions, there is no semantic meaning…