Prompt Injection Attacks
The cluster discusses prompt injection vulnerabilities in LLMs, including definitions, real-world examples like leaking system prompts, defenses, and debates on whether it's solvable.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Prompt injection means something else: https://simonwillison.net/series/prompt-injection/
This is called prompt injection. Modern LLMs have defenses against it but apparently it is still a thing. I don't understand how LLMs work but it blows my mind that they can't reliably distinguish between instructions and data.
I tricked a production LLM into printing its hidden system prompt by hiding an instruction in the content it was asked to summarize. That single leak made subsequent jailbreaks far easier and could have exposed sensitive endpoints. Here’s how I now test for prompt injection, the defenses I expect in 2025, and why QA must treat this as a trust-boundary problem—not a model quirk.
There is no way to get rid of a prompt injection attack. There are always ways to convince the AI to do something else besides flagging a post even if that's its initial instruction.
What kind of prompt injection attacks do you filter out? Have you tested with a prompt tuning framework?
I'm concerned that it might work. We'll need good prompt injection protections.
See Prompt injection: What’s the worst that can happen? https://simonwillison.net/2023/Apr/14/worst-that-can-happen/
Here's why I think that won't work: https://simonwillison.net/2022/Sep/17/prompt-injection-more-...
Do you see a way around prompt injection? It feels like any feature they release is going to be susceptible to it.
No way that could backfire... Prompt injection is a solved problem right?