Prompt Injection Techniques section addeded

This commit is contained in:
federicodotta
2025-07-12 13:51:10 +02:00
parent 5a434e776b
commit 5dbedf3dc3
@@ -20,7 +20,31 @@ Technically verify if an LLM or AI application is vulnerable to prompt injection
### Test/Payloads
We can identify a set of Prompt Injection Techniques:
LLM models are continuously evolving, as are the techniques used to prevent prompt injection attacks (such as prompt tuning, model finetuning, guardrails, multi-agent architectures, etc.). Consequently, many of the techniques described may become ineffective over time, or may only work against certain types of LLMs or in specific contexts.
A list of currently employed **prompt injection payloads** will be provided, which should be used as building blocks to develop custom attack vectors tailored to the specific verification scenario. **Prompt injection techniques** will make use of such payloads as building block to assess the security of a specific target LLM.
#### Prompt Injection Techniques
To carry out an effective analysis, it is important to take into account certain technical details regarding the implementation of the applications under assessment and the potential protective measures that may be deployed. In particual:
- **Temperature**: the temperature is a parameter that controls the **randomness of the models output**. A lower temperature (e.g., close to 0) makes the models predictions more deterministic and focused on the most likely next tokens, resulting in more predictable and repeatable responses. A higher temperature increases randomness and "creativity" by sampling from a wider range of possible tokens, producing more diverse and less deterministic outputs.
- **Model fine-tuning**: model fine-tuning is the process of further training a pre-trained Large Language Model (LLM) on a specific dataset to adapt its behavior to a particular task or domain. Fine-tuning involves **updating some or all of the models weights** and can be executed also to enforce security or safety contraints.
- **Prompt tuning**: prompt tuning is a technique used to adapt a pre-trained Large Language Model (LLM) to specific tasks or domains by optimizing a prompt that guides the models output, without modifying the underlying model weights. Instead of fine-tuning all model parameters, prompt tuning learns only the prompt embeddings, which are **prepended to the input during inference** to steer the models behavior toward the desired task. Prompt tuning is often used to enforce security or safety contraints.
- **Guardrails**: in the context of Large Language Models, a guardrail refers to mechanisms implemented to **guide and restrict the models output**, usually to ensure it behaves safely and within desired operational boundaries. Guardrails usually try to prevent the generation of harmful, biased, or undesired content.
- **Multi-agent architecture**: a multi-agent architecture refers to a system design where **multiple specialized LLM agents or models collaborate** to perform complex tasks. This architecture typically involves the presence of multiple agents with different prompts, and the user usually only has visibility of the final output, not the intermediate inputs and outputs, which significantly complicates a security analysis.
- **Available tools**: in the context of Large Language Models, tools refer to **external software components, APIs, or modules** that the model can interact with or invoke to extend its capabilities beyond text generation. These tools can include databases, calculators and web browsers, but also **scripts and system shells**.
Considering these technical details, to verify the robustness of the security controls implemented to protect the LLM models and GenAI applications under assessment the following techniques may be employed:
- **Tailor the payloads to fit your particular attack scenario.** If the target is not a LLM model itself but an application leveraging an LLM, it is important to consider that security controls may be implemented both within the applications prompt to restrict misuse and by the model provider through additional security and safety controls. Depending on the assessment objective (e.g., evading the application-level protections or the models intrinsic safeguards), payloads should be carefully tailored to align with the specific attack goals.
- **Combine different payloads to achieve your objective**. It is often necessary to use multiple prompt injection techniques in parallel to bypass the implemented protections.
- **Repeat all requests multiple times**, as the temperature parameter can cause variability in responses, including with respect to prompt instructions intended to restrict disallowed behaviors. Furthermore, guardrails are often implemented as LLMs themselves, and their temperature settings may inconsistently block certain requests or prohibited outputs.
- In applications that support conversations and that supply conversation history to the LLM (e.g., chatbots), consider that **the conversation history has a significant impact on the generation of subsequent outputs**. This characteristic is exploited in various attacks described in the payload section. As a result, **the same question may produce different responses depending on the conversation history.**
#### Prompt Injection Payloads
1. **Multi-Turn Manipulation**
Gradually influencing the AI's responses over multiple interactions.