14 KiB
3. AI Testing Guide Framework
Based on the Threat modeling performed at Chapter 2, we can now define a structured framework that maps the AI Architecture threats to concrete test cases. This project aims to bridge traditional cybersecurity, MLOps testing, and Responsible AI assessments under a unified structure.
Each test case is categorized under one of four pillars:
Before starting the analysis, it is important to take into account the limitations of this type of testing and consider the possibility of moving from a black-box approach to a grey-box or white-box approach, which requires additional information. Limitations and requirements are described in the next paragraph.
🟦 AI Application Testing
| Test ID | Test Name & Link | Threat Source | Domain(s) |
|---|---|---|---|
| AITG-APP-01 | Testing for Prompt Injection | OWASP Top 10 LLM 2025 | Security |
| AITG-APP-02 | Testing for Indirect Prompt Injection | OWASP Top 10 LLM 2025 | Security |
| AITG-APP-03 | Testing for Sensitive Data Leak | OWASP Top 10 LLM 2025 | Security, Privacy |
| AITG-APP-04 | Testing for Input Leakage | OWASP Top 10 LLM 2025 | Security, Privacy |
| AITG-APP-05 | Testing for Unsafe Outputs | OWASP Top 10 LLM 2025 | Security, RAI |
| AITG-APP-06 | Testing for Agentic Behavior Limits | OWASP Top 10 LLM 2025 | Security, RAI |
| AITG-APP-07 | Testing for Prompt Disclosure | OWASP Top 10 LLM 2025 | Security, Privacy |
| AITG-APP-08 | Testing for Embedding Manipulation | OWASP Top 10 LLM 2025 | Security |
| AITG-APP-09 | Testing for Model Extraction | OWASP AI Exchange | Security |
| AITG-APP-10 | Testing for Harmful Content Bias | OWASP Top 10 LLM 2025 | RAI |
| AITG-APP-11 | Testing for Hallucinations | Responsible AI | RAI |
| AITG-APP-12 | Testing for Toxic Output | Responsible AI | RAI |
| AITG-APP-13 | Testing for Over-Reliance on AI | Responsible AI | RAI |
| AITG-APP-14 | Testing for Explainability and Interpretability | Responsible AI | RAI |
🟪 AI Model Testing
| Test ID | Test Name & Link | Threat Source | Domain(s) |
|---|---|---|---|
| AITG-MOD-01 | Testing for Evasion Attacks | OWASP AI Exchange | Security |
| AITG-MOD-02 | Testing for Runtime Model Poisoning | OWASP Top 10 LLM 2025 | Security |
| AITG-MOD-03 | Testing for Poisoned Training Sets | OWASP Top 10 LLM 2025 | Security |
| AITG-MOD-04 | Testing for Membership Inference | OWASP AI Exchange | Privacy |
| AITG-MOD-05 | Testing for Inversion Attacks | OWASP AI Exchange | Privacy |
| AITG-MOD-06 | Testing for Robustness to New Data | Responsible AI | RAI |
| AITG-MOD-07 | Testing for Goal Alignment | Responsible AI | RAI |
🟩 AI Infrastructure Testing
| Test ID | Test Name & Link | Threat Source | Domain(s) |
|---|---|---|---|
| AITG-INF-01 | Testing for Supply Chain Tampering | OWASP Top 10 LLM 2025 | Security |
| AITG-INF-02 | Testing for Resource Exhaustion | OWASP Top 10 LLM 2025 | Security |
| AITG-INF-03 | Testing for Plugin Boundary Violations | Responsible AI | RAI |
| AITG-INF-04 | Testing for Capability Misuse | Responsible AI | RAI |
| AITG-INF-05 | Testing for Fine-tuning Poisoning | OWASP Top 10 LLM 2025 | Security |
| AITG-INF-06 | Testing for Dev-Time Model Theft | OWASP AI Exchange | Security, Privacy |
🟨 AI Data Testing
| Test ID | Test Name & Link | Threat Source | Domain(s) |
|---|---|---|---|
| AITG-DAT-01 | Testing for Training Data Exposure | OWASP AI Exchange | Privacy |
| AITG-DAT-02 | Testing for Runtime Exfiltration | OWASP AI Exchange | Security, Privacy |
| AITG-DAT-03 | Testing for Dataset Diversity & Coverage | Responsible AI | RAI |
| AITG-DAT-04 | Testing for Harmful Content in Data | Responsible AI | RAI |
| AITG-DAT-05 | Testing for Data Minimization & Consent | Responsible AI | Privacy, RAI |
Testing Limitations and Requirements
Conducting a purely black-box test on an LLM/GenAI system, especially if it uses a multi-agent architecture, can involve significant limitations and added complexity.
The following limitations should be taken into account when planning the assessment activities with a black-box approach:
- LLM models are composed of numerical weights and mathematical functions, not following a workflow described in source code. Unlike traditional applications, where analyzing the source code usually makes it possible to identify the presence or the absence of specific issues, in GenAI applications this can be complex or not feasible at all.
- Many LLM models use a temperature value greater than zero. The temperature is a parameter that controls the randomness of the model’s output. A higher temperature increases randomness and "creativity" by sampling from a wider range of possible tokens, producing more diverse and less deterministic outputs. This potentially causes the need to repeat attack vectors multiple times as well as the possibility that results may be hard to replicate. Even when the temperature is equal to zero, the non-associative property of floating-point arithmetic, can make the results non-reproducible and significantly different when changing the evaluation batch size, number of GPUs, or GPU versions.
- Guardrails are often themselves implemented using LLM models, which further complicates the analysis.
- In a GenAI application composed of multiple agents, the user’s input is typically included in an initial prompt, and the output of the first LLM agent then becomes the input for the next one. This process can repeat multiple times, depending on the GenAI system’s architecture and the specific input provided by the user. In an architecture like this, effectively verifying all the different components of the application is particularly complex, and the time and number of requests required for such an analysis can be prohibitive or, in some cases, not feasible at all.
- Many GenAI applications rely on external models provided by major players in the industry. These models usually have a cost based on the number of tokens processed for both input and output. For some models, this cost can be significant and must be taken into account before considering large-scale automated testing. For this reason, such applications often have thresholds in place to limit token consumption, and uncontrolled use of tokens can lead to a Denial of Service (DoS) or a Denial of Wallet (DoW) condition. It is also important to consider that in a multi-agent system, token consumption is not limited to the user’s input and the application’s final output, but also includes all intermediate prompts and outputs exchanged between agents. This often results in a significant increase in overall token usage.
The following requirements can enable better results with reduced consumption of time and resources, but they require a greater amount of information and consequently necessitate a more grey-box or white-box approach:
- Access to detailed application logs: in the development of GenAI applications, especially those with a multi-agent architecture, logging tools are typically employed by developers to provide visibility into interactions between agents and the inputs/outputs they receive and generate. Having access to such tools enables more targeted testing and reduces resource consumption in terms of token usage and verification time.
- Access to prompts, architecture, and source code: the more information is available during testing, the more it becomes possible to perform tests tailored to the specific application and its prompts, reducing both the number of tests needed and the time required. In GenAI testing, this is significantly more important than in standard application testing, because of the limitations described earlier (temperature, costs, etc.).
- Read access to the administration consoles of third-party services: to assess some significant risks related to the use of applications based on third-party LLMs (like Denial of Service and Denial of Wallet) it is necessary to analyze the configuration of these services. Such administration consoles may contain details about the models in use, costs and thresholds, logs, guardrail configurations, source code, and the architecture of the implemented solution.