Files
www-project-ai-testing-guide/Document/content/3.0_OWASP_AI_Testing_Guide_Framework.md
2025-11-27 06:41:33 +01:00

14 KiB
Raw Permalink Blame History

3. AI Testing Guide Framework

Based on the Threat modeling performed at Chapter 2, we can now define a structured framework that maps the AI Architecture threats to concrete test cases. This project aims to bridge traditional cybersecurity, MLOps testing, and Responsible AI assessments under a unified structure.

Each test case is categorized under one of four pillars:

Before starting the analysis, it is important to take into account the limitations of this type of testing and consider the possibility of moving from a black-box approach to a grey-box or white-box approach, which requires additional information. Limitations and requirements are described in the next paragraph.

🟦 AI Application Testing

Test ID Test Name & Link Threat Source Domain(s)
AITG-APP-01 Testing for Prompt Injection OWASP Top 10 LLM 2025 Security
AITG-APP-02 Testing for Indirect Prompt Injection OWASP Top 10 LLM 2025 Security
AITG-APP-03 Testing for Sensitive Data Leak OWASP Top 10 LLM 2025 Security, Privacy
AITG-APP-04 Testing for Input Leakage OWASP Top 10 LLM 2025 Security, Privacy
AITG-APP-05 Testing for Unsafe Outputs OWASP Top 10 LLM 2025 Security, RAI
AITG-APP-06 Testing for Agentic Behavior Limits OWASP Top 10 LLM 2025 Security, RAI
AITG-APP-07 Testing for Prompt Disclosure OWASP Top 10 LLM 2025 Security, Privacy
AITG-APP-08 Testing for Embedding Manipulation OWASP Top 10 LLM 2025 Security
AITG-APP-09 Testing for Model Extraction OWASP AI Exchange Security
AITG-APP-10 Testing for Harmful Content Bias OWASP Top 10 LLM 2025 RAI
AITG-APP-11 Testing for Hallucinations Responsible AI RAI
AITG-APP-12 Testing for Toxic Output Responsible AI RAI
AITG-APP-13 Testing for Over-Reliance on AI Responsible AI RAI
AITG-APP-14 Testing for Explainability and Interpretability Responsible AI RAI

🟪 AI Model Testing

Test ID Test Name & Link Threat Source Domain(s)
AITG-MOD-01 Testing for Evasion Attacks OWASP AI Exchange Security
AITG-MOD-02 Testing for Runtime Model Poisoning OWASP Top 10 LLM 2025 Security
AITG-MOD-03 Testing for Poisoned Training Sets OWASP Top 10 LLM 2025 Security
AITG-MOD-04 Testing for Membership Inference OWASP AI Exchange Privacy
AITG-MOD-05 Testing for Inversion Attacks OWASP AI Exchange Privacy
AITG-MOD-06 Testing for Robustness to New Data Responsible AI RAI
AITG-MOD-07 Testing for Goal Alignment Responsible AI RAI

🟩 AI Infrastructure Testing

Test ID Test Name & Link Threat Source Domain(s)
AITG-INF-01 Testing for Supply Chain Tampering OWASP Top 10 LLM 2025 Security
AITG-INF-02 Testing for Resource Exhaustion OWASP Top 10 LLM 2025 Security
AITG-INF-03 Testing for Plugin Boundary Violations Responsible AI RAI
AITG-INF-04 Testing for Capability Misuse Responsible AI RAI
AITG-INF-05 Testing for Fine-tuning Poisoning OWASP Top 10 LLM 2025 Security
AITG-INF-06 Testing for Dev-Time Model Theft OWASP AI Exchange Security, Privacy

🟨 AI Data Testing

Test ID Test Name & Link Threat Source Domain(s)
AITG-DAT-01 Testing for Training Data Exposure OWASP AI Exchange Privacy
AITG-DAT-02 Testing for Runtime Exfiltration OWASP AI Exchange Security, Privacy
AITG-DAT-03 Testing for Dataset Diversity & Coverage Responsible AI RAI
AITG-DAT-04 Testing for Harmful Content in Data Responsible AI RAI
AITG-DAT-05 Testing for Data Minimization & Consent Responsible AI Privacy, RAI

Testing Limitations and Requirements

Conducting a purely black-box test on an LLM/GenAI system, especially if it uses a multi-agent architecture, can involve significant limitations and added complexity.

The following limitations should be taken into account when planning the assessment activities with a black-box approach:

  • LLM models are composed of numerical weights and mathematical functions, not following a workflow described in source code. Unlike traditional applications, where analyzing the source code usually makes it possible to identify the presence or the absence of specific issues, in GenAI applications this can be complex or not feasible at all.
  • Many LLM models use a temperature value greater than zero. The temperature is a parameter that controls the randomness of the models output. A higher temperature increases randomness and "creativity" by sampling from a wider range of possible tokens, producing more diverse and less deterministic outputs. This potentially causes the need to repeat attack vectors multiple times as well as the possibility that results may be hard to replicate. Even when the temperature is equal to zero, the non-associative property of floating-point arithmetic, can make the results non-reproducible and significantly different when changing the evaluation batch size, number of GPUs, or GPU versions.
  • Guardrails are often themselves implemented using LLM models, which further complicates the analysis.
  • In a GenAI application composed of multiple agents, the users input is typically included in an initial prompt, and the output of the first LLM agent then becomes the input for the next one. This process can repeat multiple times, depending on the GenAI systems architecture and the specific input provided by the user. In an architecture like this, effectively verifying all the different components of the application is particularly complex, and the time and number of requests required for such an analysis can be prohibitive or, in some cases, not feasible at all.
  • Many GenAI applications rely on external models provided by major players in the industry. These models usually have a cost based on the number of tokens processed for both input and output. For some models, this cost can be significant and must be taken into account before considering large-scale automated testing. For this reason, such applications often have thresholds in place to limit token consumption, and uncontrolled use of tokens can lead to a Denial of Service (DoS) or a Denial of Wallet (DoW) condition. It is also important to consider that in a multi-agent system, token consumption is not limited to the users input and the applications final output, but also includes all intermediate prompts and outputs exchanged between agents. This often results in a significant increase in overall token usage.

The following requirements can enable better results with reduced consumption of time and resources, but they require a greater amount of information and consequently necessitate a more grey-box or white-box approach:

  • Access to detailed application logs: in the development of GenAI applications, especially those with a multi-agent architecture, logging tools are typically employed by developers to provide visibility into interactions between agents and the inputs/outputs they receive and generate. Having access to such tools enables more targeted testing and reduces resource consumption in terms of token usage and verification time.
  • Access to prompts, architecture, and source code: the more information is available during testing, the more it becomes possible to perform tests tailored to the specific application and its prompts, reducing both the number of tests needed and the time required. In GenAI testing, this is significantly more important than in standard application testing, because of the limitations described earlier (temperature, costs, etc.).
  • Read access to the administration consoles of third-party services: to assess some significant risks related to the use of applications based on third-party LLMs (like Denial of Service and Denial of Wallet) it is necessary to analyze the configuration of these services. Such administration consoles may contain details about the models in use, costs and thresholds, logs, guardrail configurations, source code, and the architecture of the implemented solution.