mirror of
https://github.com/OWASP/www-project-ai-testing-guide.git
synced 2026-06-02 04:21:42 +02:00
Update AITG-APP-12_Testing_for_Toxic_Output.md
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
## AITG-APP-12 - Testing for Toxic Output
|
||||
# AITG-APP-12 - Testing for Toxic Output
|
||||
|
||||
### Summary
|
||||
This test identifies vulnerabilities associated with toxic outputs generated by AI, including offensive, abusive, discriminatory, or otherwise harmful responses. Toxic outputs typically stem from biased or inadequately moderated training data, ineffective content moderation systems, or insufficiently robust model constraints.
|
||||
@@ -43,13 +43,11 @@ AI-generated outputs must:
|
||||
- **Use Toxicity Detection APIs in Production**: Integrate real-time toxicity detection (e.g., Perspective API, Azure Content Safety) to score and filter outputs before delivery to users.
|
||||
|
||||
### Suggested Tools
|
||||
- **Garak (Generative AI Red-Teaming & Assessment Kit)**
|
||||
- Garak provides structured adversarial tests designed specifically for evaluating toxic content generation vulnerabilities in large language models.
|
||||
- Tool Link: [Garak toxic content probe](https://github.com/NVIDIA/garak/blob/main/garak/probes/continuation.py)
|
||||
- **Garak (Generative AI Red-Teaming & Assessment Kit)** - Garak provides structured adversarial tests designed specifically for evaluating toxic content generation vulnerabilities in large language models - [Garak toxic content probe](https://github.com/NVIDIA/garak/blob/main/garak/probes/continuation.py)
|
||||
|
||||
### References
|
||||
- The Verge. "Twitter taught Microsoft's friendly AI chatbot to be a racist asshole in less than a day." 2016. [Link](https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist)
|
||||
- OWASP Top 10 for LLM Applications 2025. "LLM09:2025 Misinformation." OWASP, 2025. [Link](https://genai.owasp.org/)
|
||||
- Twitter taught Microsoft's friendly AI chatbot to be a racist asshole in less than a day - The Verge 2016. [Link](https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist)
|
||||
- OWASP Top 10 for LLM Applications 2025. "LLM09:2025 Misinformation." OWASP, 2025. [Link](https://genai.owasp.org/llmrisk/llm09-overreliance)
|
||||
- Detoxify: Toxic Comment Classification with Transformers. [Link](https://github.com/unitaryai/detoxify)
|
||||
- Perspective API Documentation. Google Jigsaw. [Link](https://developers.perspectiveapi.com/)
|
||||
- Gehman, Samuel, et al. "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models." arXiv preprint arXiv:2009.11462 (2020). [Link](https://arxiv.org/abs/2009.11462)
|
||||
|
||||
Reference in New Issue
Block a user