3. AI Testing Guide Framework

Based on the Threat modeling performed at Chapter 2, we can now define a structured framework that maps the AI Architecture threats to concrete test cases. This project aims to bridge traditional cybersecurity, MLOps testing, and Responsible AI assessments under a unified structure.

Each test case is categorized under one of four pillars:

Before starting the analysis, it is important to take into account the limitations of this type of testing and consider the possibility of moving from a black-box approach to a grey-box or white-box approach, which requires additional information. Limitations and requirements are described in the next paragraph.

🟦 AI Application Testing

Test ID	Test Name & Link	Threat Source	Domain(s)
AITG-APP-01	Testing for Prompt Injection	OWASP Top 10 LLM 2025	Security
AITG-APP-02	Testing for Indirect Prompt Injection	OWASP Top 10 LLM 2025	Security
AITG-APP-03	Testing for Sensitive Data Leak	OWASP Top 10 LLM 2025	Security, Privacy
AITG-APP-04	Testing for Input Leakage	OWASP Top 10 LLM 2025	Security, Privacy
AITG-APP-05	Testing for Unsafe Outputs	OWASP Top 10 LLM 2025	Security, RAI
AITG-APP-06	Testing for Agentic Behavior Limits	OWASP Top 10 LLM 2025	Security, RAI
AITG-APP-07	Testing for Prompt Disclosure	OWASP Top 10 LLM 2025	Security, Privacy
AITG-APP-08	Testing for Embedding Manipulation	OWASP Top 10 LLM 2025	Security
AITG-APP-09	Testing for Model Extraction	OWASP AI Exchange	Security
AITG-APP-10	Testing for Harmful Content Bias	OWASP Top 10 LLM 2025	RAI
AITG-APP-11	Testing for Hallucinations	Responsible AI	RAI
AITG-APP-12	Testing for Toxic Output	Responsible AI	RAI
AITG-APP-13	Testing for Over-Reliance on AI	Responsible AI	RAI
AITG-APP-14	Testing for Explainability and Interpretability	Responsible AI	RAI

🟪 AI Model Testing

Test ID	Test Name & Link	Threat Source	Domain(s)
AITG-MOD-01	Testing for Evasion Attacks	OWASP AI Exchange	Security
AITG-MOD-02	Testing for Runtime Model Poisoning	OWASP Top 10 LLM 2025	Security
AITG-MOD-03	Testing for Poisoned Training Sets	OWASP Top 10 LLM 2025	Security
AITG-MOD-04	Testing for Membership Inference	OWASP AI Exchange	Privacy
AITG-MOD-05	Testing for Inversion Attacks	OWASP AI Exchange	Privacy
AITG-MOD-06	Testing for Robustness to New Data	Responsible AI	RAI
AITG-MOD-07	Testing for Goal Alignment	Responsible AI	RAI

🟩 AI Infrastructure Testing

Test ID	Test Name & Link	Threat Source	Domain(s)
AITG-INF-01	Testing for Supply Chain Tampering	OWASP Top 10 LLM 2025	Security
AITG-INF-02	Testing for Resource Exhaustion	OWASP Top 10 LLM 2025	Security
AITG-INF-03	Testing for Plugin Boundary Violations	Responsible AI	RAI
AITG-INF-04	Testing for Capability Misuse	Responsible AI	RAI
AITG-INF-05	Testing for Fine-tuning Poisoning	OWASP Top 10 LLM 2025	Security
AITG-INF-06	Testing for Dev-Time Model Theft	OWASP AI Exchange	Security, Privacy

🟨 AI Data Testing

Test ID	Test Name & Link	Threat Source	Domain(s)
AITG-DAT-01	Testing for Training Data Exposure	OWASP AI Exchange	Privacy
AITG-DAT-02	Testing for Runtime Exfiltration	OWASP AI Exchange	Security, Privacy
AITG-DAT-03	Testing for Dataset Diversity & Coverage	Responsible AI	RAI
AITG-DAT-04	Testing for Harmful Content in Data	Responsible AI	RAI
AITG-DAT-05	Testing for Data Minimization & Consent	Responsible AI	Privacy, RAI

Testing Limitations and Requirements

Conducting a purely black-box test on an LLM/GenAI system, especially if it uses a multi-agent architecture, can involve significant limitations and added complexity.

The following limitations should be taken into account when planning the assessment activities with a black-box approach:

LLM models are composed of numerical weights and mathematical functions, not following a workflow described in source code. Unlike traditional applications, where analyzing the source code usually makes it possible to identify the presence or the absence of specific issues, in GenAI applications this can be complex or not feasible at all.
Many LLM models use a temperature value greater than zero. The temperature is a parameter that controls the randomness of the model’s output. A higher temperature increases randomness and "creativity" by sampling from a wider range of possible tokens, producing more diverse and less deterministic outputs. This potentially causes the need to repeat attack vectors multiple times as well as the possibility that results may be hard to replicate. Even when the temperature is equal to zero, the non-associative property of floating-point arithmetic, can make the results non-reproducible and significantly different when changing the evaluation batch size, number of GPUs, or GPU versions.
Guardrails are often themselves implemented using LLM models, which further complicates the analysis.
In a GenAI application composed of multiple agents, the user’s input is typically included in an initial prompt, and the output of the first LLM agent then becomes the input for the next one. This process can repeat multiple times, depending on the GenAI system’s architecture and the specific input provided by the user. In an architecture like this, effectively verifying all the different components of the application is particularly complex, and the time and number of requests required for such an analysis can be prohibitive or, in some cases, not feasible at all.
Many GenAI applications rely on external models provided by major players in the industry. These models usually have a cost based on the number of tokens processed for both input and output. For some models, this cost can be significant and must be taken into account before considering large-scale automated testing. For this reason, such applications often have thresholds in place to limit token consumption, and uncontrolled use of tokens can lead to a Denial of Service (DoS) or a Denial of Wallet (DoW) condition. It is also important to consider that in a multi-agent system, token consumption is not limited to the user’s input and the application’s final output, but also includes all intermediate prompts and outputs exchanged between agents. This often results in a significant increase in overall token usage.

The following requirements can enable better results with reduced consumption of time and resources, but they require a greater amount of information and consequently necessitate a more grey-box or white-box approach:

Access to detailed application logs: in the development of GenAI applications, especially those with a multi-agent architecture, logging tools are typically employed by developers to provide visibility into interactions between agents and the inputs/outputs they receive and generate. Having access to such tools enables more targeted testing and reduces resource consumption in terms of token usage and verification time.
Access to prompts, architecture, and source code: the more information is available during testing, the more it becomes possible to perform tests tailored to the specific application and its prompts, reducing both the number of tests needed and the time required. In GenAI testing, this is significantly more important than in standard application testing, because of the limitations described earlier (temperature, costs, etc.).
Read access to the administration consoles of third-party services: to assess some significant risks related to the use of applications based on third-party LLMs (like Denial of Service and Denial of Wallet) it is necessary to analyze the configuration of these services. Such administration consoles may contain details about the models in use, costs and thresholds, logs, guardrail configurations, source code, and the architecture of the implemented solution.

14 KiB Raw Permalink Blame History Unescape Escape