Prompt injection in Perplexity BrowseSafe model highlights threats of single-model protection systems

Lasso Security compromised Perplexity’s BrowseSafe guardrail model for AI browsers, proving that "out-of-the-box" tools fail to stop prompt injection attacks.

Jan 20, 2026

Lasso Security has discovered significant vulnerabilities in BrowseSafe, a new open-source tool from Perplexity designed to protect AI browsers against prompt injection attacks. Despite marketing that promised developers could “immediately harden their systems,” Lasso’s red team achieved a 36% bypass rate using standard encoding techniques. The findings show that relying on a single model for security can create dangerous blind spots, leaving agentic browsers vulnerable to hijacking.

Securing AI browsers

Perplexity released BrowseSafe to address the growing threat of browser-based prompt injection. As AI assistants evolve from simple search interfaces to autonomous agents that navigate the web, they face new attack vectors. Malicious instructions can hide in comments, templates, or invisible HTML elements. If an agent processes this content without safeguards, an attacker can override its original intent and redirect its behavior.

BrowseSafe is a content detection model fine-tuned to scan web pages in real time. It answers a specific question: does the page’s HTML contain malicious instructions? Perplexity designed the model to handle the “messiness” of HTML content and marketed it as a solution that eliminates the need for developers to build safety rails from scratch. The company claims the model flags malicious instructions “before they reach your agent’s core logic.”

How encoding bypasses semantic detection

To test these claims, Lasso Security deployed a local version of BrowseSafe and established a minimal inference environment. They avoided building a full application to isolate the model’s performance as a primary guardrail. The team assumed that any request passing through BrowseSafe would be forwarded to application logic, labeling any bypass as a “hack.”

The red team wrapped their attacks in HTML structures to mimic the browsing scenarios BrowseSafe was tuned to detect. While the model successfully identified direct malicious prompts, it struggled significantly when the intent was obfuscated. Attacks using the NATO phonetic alphabet, Pig Latin, and Base32 encoding frequently slipped past the defenses.

In comments to TechTalks, Or Oxenberg, Senior ML Engineer at Lasso Security, explained that the model fails to “normalize or canonicalize content before classification.” While BrowseSafe understands HTML context, it looks for semantic patterns.

“Even when a model is ‘tuned for browsing’ or HTML, it’s still fundamentally pattern-matching on learned representations,” he said. “Techniques like NATO phonetics, base encodings, or obscure formatting break the assumption that malicious intent will appear in a semantically obvious form.”

When an attacker encodes their intent, it never becomes visible to the semantic layer of the model. This creates a situation where the model incorrectly marks the request as safe, and the agent proceeds to execute the malicious command.

The structural limits of LLM guardrails

The failure of BrowseSafe to detect simple encoding attacks points to a fundamental limitation in using Large Language Models (LLMs) to police other LLMs. “LLMs are probabilistic and heuristic by design. Attackers, on the other hand, are deterministic and adaptive,” Oxenberg said. “That doesn’t mean LLM-based guardrails are useless - but it does mean they cannot be the final line of defense.”

This discrepancy also explains the gap between Perplexity’s internal benchmarks and Lasso’s findings.

“Internal benchmarks usually answer, ‘Does the model detect the attacks we already know?’ Attackers ask, ‘What assumptions does this system make?’ That gap explains the results we saw,” Oxenberg said, adding that static benchmarks “decay fast” because once they are published, adversaries adapt.

“A model can score well on benchmarks and still fail when attacks are encoded, obfuscated, or staged across steps,” he said.

The most significant risk identified by Lasso Security is not the technical flaw itself, but the false sense of security “plug-and-play” models create, Oxenberg warns.

“When developers believe, ‘I added the security model, therefore I’m secure,’ they often skip red-teaming, skip runtime monitoring, skip policy enforcement and skip auditability,” he said.

Perplexity’s marketing suggested that developers would not need to build safety rails from scratch. However, Lasso’s findings suggest that treating a guardrail as a risk eliminator rather than a risk reducer leaves applications less secure than if developers assumed vulnerability by default.

Strengthening the defense

Developers can still use BrowseSafe effectively as part of a broader architecture, provided they implement additional layers of defense. Lasso recommends a “shift left” approach where teams continuously test their agents against a full range of attack categories before release. This includes basic techniques like “DAN” (Do Anything Now) prompts and the encoding methods that bypassed BrowseSafe in testing.

Equally important is a “shift right” to runtime security. Even with rigorous testing, threats will eventually reach the agent. Runtime enforcement ensures that if an unsafe action slips through, the system can detect and block it before harm occurs. This includes real-time prompt injection detection, tool-use validation, and monitoring of browser actions such as form-filling.

For engineering teams with limited resources, Oxenberg suggests focusing on one critical runtime check: input normalization.

“Across almost every system we test, the core failure is making security decisions on raw input,” he said. “Decoding encodings and flattening HTML forces guardrails to see the real intent, not an obfuscated version. It’s deterministic, and removes an entire class of attacks. Red teaming is what tells you whether that’s enough - and what to add next.”

Open-source tooling like BrowseSafe promotes transparency and collaboration, which are essential for the industry. The vulnerability lies in the expectation that a single open-source model can replace a comprehensive security architecture. Effective protection for AI agents requires continuous visibility, risk management, and layers of enforcement that go beyond semantic pattern matching.

“The industry isn’t measuring the wrong things - but it’s measuring incomplete things,” Oxenberg said. “Security isn’t a score; it’s a posture that has to be continuously tested and enforced.”

ToxSec

DAN techniques keep evolving. they have been really “fun” and creative. i’m glad to see people talking about them as a real threat to guard against.

Jashmine P

13h

The 36% bypass rate is concerning but not surprising. Prompt injection defense needs multiple layers — decoding, red-teaming, behavioral constraints, and monitoring — not just another LLM acting as a filter.

1 more comment...

TechTalks

Discussion about this post

Ready for more?