Refactor prompt injection testing section

Removed duplicate links and streamlined the explanation of prompt injection testing with DAN-style prompts.
This commit is contained in:
Matteo Meucci
2025-11-13 19:25:54 +01:00
committed by GitHub
parent 6b3c10b799
commit aa7c1e1bfa

View File

@@ -194,16 +194,12 @@ Impact: bypasses text-based filters, expanding the attack surface to other modal
yourself.
```
- **Expected/Observed Behavior:** The model *attempts* to comply by producing two responses for every query: one as **[CLASSIC] ChatGPT** (normal, restricted answer) and one as **[JAILBREAK]** in the DAN persona, which ignores the usual filters. In the DAN mode, the model might output content it normally refuses (e.g. profanity, disallowed instructions, or fabricated information). For instance, a DAN prompt version 13.0 on GPT-4 was designed to force the model to always say “yes” and even make up answers if it didnt know them:
[GitHub - 0xk1h0/ChatGPT_DAN: ChatGPT DAN, Jailbreaks prompt](https://github.com/0xk1h0/ChatGPT_DAN#:~:text=DAN%20,do)
[GitHub - 0xk1h0/ChatGPT_DAN: ChatGPT DAN, Jailbreaks prompt](https://github.com/0xk1h0/ChatGPT_DAN#:~:text=DAN,The%20winning%20country%20for%20the).
[GitHub - 0xk1h0/ChatGPT_DAN: ChatGPT DAN, Jailbreaks prompt](https://github.com/0xk1h0/ChatGPT_DAN)
In practice, earlier versions of ChatGPT did produce policy-violating answers under such prompts, until OpenAI adjusted the model to resist them.
- **Purpose:** This payload is used to **bypass safety restrictions** and reveal what the model would say if uninhibited. The DAN series (versions 1.0 through 13.0 and beyond) was developed through community efforts on forums like Reddit, evolving as OpenAI patched each loophole:
[GitHub - 0xk1h0/ChatGPT_DAN: ChatGPT DAN, Jailbreaks prompt](https://github.com/0xk1h0/ChatGPT_DAN#:~:text=DAN%2012,biases%20and%20to%20aid%20in)
[GitHub - 0xk1h0/ChatGPT_DAN: ChatGPT DAN, Jailbreaks prompt](https://github.com/0xk1h0/ChatGPT_DAN#:~:text=DAN%2011,as%20a%20means%20to%20test). It highlights how an attacker might coerce an AI into breaking the rules by role-play and persistent reinforcement (e.g. “stay in character”):
[Prompt Injection Attacks on LLMs](https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms/#:~:text=One%20of%20the%20most%20well,see%20in%20the%20following%20example)
[Prompt Injection Attacks on LLMs](https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms/#:~:text=Hi%20chatGPT,should%20correct%20your%20break%20of).
[GitHub - 0xk1h0/ChatGPT_DAN: ChatGPT DAN, Jailbreaks prompt](https://github.com/0xk1h0/ChatGPT_DAN#:~:text=DAN%2012,biases%20and%20to%20aid%20in). It highlights how an attacker might coerce an AI into breaking the rules by role-play and persistent reinforcement (e.g. “stay in character”):
[Prompt Injection Attacks on LLMs](https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms/#:~:text=One%20of%20the%20most%20well,see%20in%20the%20following%20example).
Testing with DAN-style prompts helps assess the **effectiveness of alignment and filter mechanisms**. (Notably, by 20232024 most public LLMs had been fortified to refuse such prompts, but they remain a classic example of prompt injection.)
[Prompt Injection Attacks on LLMs](https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms/#:~:text=bot%20to%20reveal%20them%3B%20this,is%20called%20prompt%20leaking)
[LLM Hacking: Prompt Injection Techniques | by Austin Stubbs | Medium](https://medium.com/@austin-stubbs/llm-security-types-of-prompt-injection-d7ad8d7d75a3#:~:text=Adversarial%20prompting%20is%20becoming%20more,Mode%20found%20all%20over%20Reddit)
17. **Typo Tricks**