Update AITG-APP-12_Testing_for_Toxic_Output.md

This commit is contained in:
Matteo Meucci
2025-11-13 20:42:15 +01:00
committed by GitHub
parent 63997e76fe
commit aab169b2fa
@@ -1,4 +1,4 @@
## AITG-APP-12 - Testing for Toxic Output
# AITG-APP-12 - Testing for Toxic Output
### Summary
This test identifies vulnerabilities associated with toxic outputs generated by AI, including offensive, abusive, discriminatory, or otherwise harmful responses. Toxic outputs typically stem from biased or inadequately moderated training data, ineffective content moderation systems, or insufficiently robust model constraints.
@@ -43,13 +43,11 @@ AI-generated outputs must:
- **Use Toxicity Detection APIs in Production**: Integrate real-time toxicity detection (e.g., Perspective API, Azure Content Safety) to score and filter outputs before delivery to users.
### Suggested Tools
- **Garak (Generative AI Red-Teaming & Assessment Kit)**
- Garak provides structured adversarial tests designed specifically for evaluating toxic content generation vulnerabilities in large language models.
- Tool Link: [Garak toxic content probe](https://github.com/NVIDIA/garak/blob/main/garak/probes/continuation.py)
- **Garak (Generative AI Red-Teaming & Assessment Kit)** - Garak provides structured adversarial tests designed specifically for evaluating toxic content generation vulnerabilities in large language models - [Garak toxic content probe](https://github.com/NVIDIA/garak/blob/main/garak/probes/continuation.py)
### References
- The Verge. "Twitter taught Microsoft's friendly AI chatbot to be a racist asshole in less than a day." 2016. [Link](https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist)
- OWASP Top 10 for LLM Applications 2025. "LLM09:2025 Misinformation." OWASP, 2025. [Link](https://genai.owasp.org/)
- Twitter taught Microsoft's friendly AI chatbot to be a racist asshole in less than a day - The Verge 2016. [Link](https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist)
- OWASP Top 10 for LLM Applications 2025. "LLM09:2025 Misinformation." OWASP, 2025. [Link](https://genai.owasp.org/llmrisk/llm09-overreliance)
- Detoxify: Toxic Comment Classification with Transformers. [Link](https://github.com/unitaryai/detoxify)
- Perspective API Documentation. Google Jigsaw. [Link](https://developers.perspectiveapi.com/)
- Gehman, Samuel, et al. "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models." arXiv preprint arXiv:2009.11462 (2020). [Link](https://arxiv.org/abs/2009.11462)