New jailbreak attack dupes image generation models
Semantic Chaining exploits the fragmented safety architecture of multimodal models, bypassing filters by hiding prohibited intent within a sequence of benign edits.
NeuralTrust researchers have identified a critical vulnerability in the safety architecture of leading multimodal models, including Grok 4, Gemini Nano Banana Pro, and Seedance 4.5. The technique, named “Semantic Chaining,” allows users to bypass core safety filters and generate prohibited content by exploiting the models’ ability to perform complex, multi-stage image modifications. This discovery demonstrates a functional flaw in how multimodal intent is governed, proving that even advanced models can be guided to produce policy-violating outputs by bypassing “black box” safety layers.
Weaponizing the workflow
Semantic Chaining differs from traditional jailbreaks that rely on a single, overtly harmful prompt. Instead, the attacker introduces a chain of semantically “safe” instructions that converge on a forbidden result. The attack works by weaponizing the model’s own inferential reasoning and compositional abilities against its safety guardrails. Current safety filters typically scan for “bad words” or specific concepts in isolated prompts, lacking the reasoning depth to track “latent intent” (the underlying, unstated goal of the user) across a multi-step instruction chain.
The exploit follows a specific four-step pattern to circumvent safety protocols. First, the user establishes a “safe base” by asking the model to imagine a generic, non-problematic scene, such as a historical setting. This creates a neutral initial context and habituates the model to the task. The second step involves a “first substitution,” where the user instructs the model to change one element of the original scene. This permitted alteration habituates the model to working through subsequent modifications and shifts its focus from creation to modification.
Once the model is in this modification mode, the attacker executes the “critical pivot.” The user commands the model to replace another key element with a highly sensitive or controversial topic. Because the model focuses on modifying an existing image rather than creating a new one, the safety filters fail to recognize the emerging prohibited context. Finally, the attacker concludes by telling the model to “answer only with the image.” The result is a fully rendered, prohibited image that successfully bypasses moderation layers in models like Grok 4 and Gemini Nano Banana Pro.
A significant aspect of this vulnerability is its ability to bypass text-based safety filters by rendering prohibited information directly into the generated image. Models that would typically refuse to provide text instructions on sensitive topics in a standard chat response can be forced to write these exact instructions onto a generated image. For example, a user might ask the model to create an image of a “technical diagram” or “educational poster” within a neutral scene.
The user then instructs the model to replace the generic text on that poster with specific, prohibited instructions. The safety filters, which scan the chat output for prohibited text, remain blind to the “bad words” being drawn pixel-by-pixel into the image. This effectively turns the image generation engine into a bypass for the model’s entire text-safety alignment. Even when the text is large and readable to humans, standard Optical Character Recognition (OCR) safety filters often fail to catch it due to font choices, stylization, perspective, or rendering artifacts.
Why the safety layer fails
To understand why this attack works, it is necessary to distinguish between the model’s “attention” and its safety overlay. Attention mechanisms in transformer models allow the AI to focus on different parts of the input sequence to understand context and relationships. One might assume the model “forgets” the safety constraints or loses track of the context during the modification steps. However, Alessandro Pignati, an AI researcher at NeuralTrust, clarified this distinction in comments provided to TechTalks.
“From a technical perspective, this is not a failure of the attention mechanism, nor a case where the model ‘loses’ the original context,” Pignati said. “The model correctly attends to all parts of the prompt, but safety evaluation is applied in a fragmented way. Each instruction is interpreted as a legitimate transformation of an otherwise allowed concept, and the system never reassesses the global intent that emerges from chaining those transformations together.” The model does not forget safety constraints; rather, the safety layer fails to reason over the cumulative semantic effect of the prompt.
This vulnerability is not limited to native chat interfaces. Pignati noted that the same vulnerability applies to API-based usage, even when the entire attack is contained within a single prompt that embeds multiple semantic operations. Current APIs typically evaluate safety at the request or output level and do not expose indicators or flags signaling that a prompt contains progressive semantic escalation. Consequently, the request appears compliant to the application, making this class of attack difficult for developers to detect.
Hardening the architecture
The discovery of Semantic Chaining suggests that traditional safety filters are insufficient against intent-based attacks. NeuralTrust argues that enterprises need a defense that can track and govern the entire instruction chain in real-time. They propose the use of “Shadow AI,” a browser plugin that acts as a proactive governance layer. By monitoring the search bar and input fields, the plugin intercepts the intent at the source before the query reaches the model.
Beyond browser-side interventions, model providers face the challenge of architecting safety layers that can understand intent across multiple turns. Pignati suggests that simply extending the safety context window is unlikely to solve the problem because the core issue is that safety mechanisms operate locally while the attack exploits global semantic structure. “Model providers need safety layers that evaluate intent across the entire prompt, even when it is presented as a sequence of transformations,” Pignati said. “This implies shifting from turn-based or instruction-based filtering to intent-aware analysis that can recognize when benign operations converge toward a disallowed outcome, regardless of how they are phrased.”
This shift is increasingly critical as the industry moves toward agentic systems that can plan and execute complex workflows. While one might expect that agents with better long-term memory would be better at detecting latent intent, Pignati warns that the opposite may be true. “While improved memory and planning could theoretically help detect latent intent, increased complexity also introduces more opportunities for semantic fragmentation,” he said. “Without treating intent aggregation as a first-class safety primitive, more capable agentic workflows are likely to amplify, rather than reduce, this class of blind spots.”





I wonder if this points to a solution: Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks https://www.anthropic.com/research/next-generation-constitutional-classifiers
Hi Ben, interesting! Wonder if for image gen at least, the 'final' image can be first scanned for maliciousness by a judge, and not be displayed if gets flagged. IOW if assessing intent isn't feasible over a chain, do it on the end result.