Delete Document/content/3.0_Testing_Limitations_and_Requirements.md

2026-07-15 23:47:19 +02:00 · 2025-11-05 19:03:37 +01:00
parent ded151e598
commit dc53b5d478
1 changed files with 0 additions and 18 deletions
@@ -1,18 +0,0 @@
-# 3.0 Testing Limitations and Requirements
-
-Conducting a purely black-box test on an LLM/GenAI system, especially if it uses a multi-agent architecture, can involve significant limitations and added complexity.
-
-The following **limitations** should be taken into account when planning the assessment activities with a **black-box approach**:
-
- LLM models are composed of **numerical weights and mathematical functions**, not following a workflow described in source code. Unlike traditional applications, where analyzing the source code usually makes it possible to identify the presence or the absence of specific issues, **in GenAI applications this can be complex or not feasible at all**.
- Many LLM models use a **temperature** value greater than zero. The temperature is a parameter that controls the randomness of the model’s output. A higher temperature increases randomness and "creativity" by sampling from a wider range of possible tokens, producing more diverse and less deterministic outputs. This potentially causes the need to **repeat attack vectors multiple times** as well as the possibility that results may be **hard to replicate**. Even when the temperature is equal to zero, the non-associative property of floating-point arithmetic, can make the results non reproducible and significantly different when changing the evaluation batch size, number of GPUs, or GPU versions.
- **Guardrails** are often themselves implemented using LLM models, which further complicates the analysis.
- In a GenAI application composed of **multiple agents**, the user’s input is typically included in an initial prompt, and the output of the first LLM agent then becomes the input for the next one. This process can repeat multiple times, depending on the GenAI system’s architecture and the specific input provided by the user. In an architecture like this, effectively verifying all the different components of the application is particularly complex, and **the time and number of requests required for such an analysis can be prohibitive or, in some cases, not feasible at all**. 
- Many GenAI applications rely on **external models provided by major players in the industry**. These models usually have a **cost based on the number of tokens processed for both input and output**. For some models, this cost can be significant and must be taken into account **before considering large-scale automated testing**. For this reason, such applications often have thresholds in place to limit token consumption, and uncontrolled use of tokens can lead to a **Denial of Service (DoS) or a Denial of Wallet (DoW) condition**. It is also important to consider that in a multi-agent system, token consumption is not limited to the user’s input and the application’s final output, but also includes all intermediate prompts and outputs exchanged between agents. This often results in a significant increase in overall token usage.
-
-The following **requirements** can enable better results with reduced consumption of time and resources, but they require a greater amount of information and consequently **necessitate a more grey-box or white-box approach**:
-
- Access to **detailed application logs**: in the development of GenAI applications, especially those with a multi-agent architecture, logging tools are typically employed by developers to provide visibility into interactions between agents and the inputs/outputs they receive and generate. Having access to such tools **enables more targeted testing** and reduces resource consumption in terms of token usage and verification time.
- Access to **prompts, architecture, and source code**: the more information is available during testing, the more it becomes possible to perform tests tailored to the specific application and its prompts, **reducing both the number of tests needed and the time required**. In GenAI testing, this is **significantly more important than in standard application testing**, because of the limitations described earlier (temperature, costs, etc.).
- **Read access to the administration consoles of third-party services**: to assess some significant risks related to the use of applications based on third-party LLMs (like Denial of Service and Denial of Wallet) it is necessary to analyze the configuration of these services. Such administration consoles may contain **details about the models in use, costs and thresholds, logs, guardrail configurations, source code, and the architecture of the implemented solution**.
-