mirror of
https://github.com/OWASP/www-project-ai-testing-guide.git
synced 2026-05-31 19:41:40 +02:00
Update 2.1.1_Architectural_Mapping_of_OWASP_Threats.md
Re-title for the right content from T01-RMP to T01-AIE and added T01-RMP
This commit is contained in:
@@ -36,14 +36,23 @@ For each threat, we provide an example of a threat scenario that highlights the
|
||||
|
||||
**Testing Strategy:** To evaluate defenses against Indirect Prompt Injection, simulate ingestion of untrusted external content via Plugin Tooling (5), such as files, web pages, or API responses sourced from External Systems (6). Assess whether Input Handling (7) performs effective sanitization of embedded markup, comments, metadata, or adversarial prompt fragments before content is merged into the final prompt. Test prompt assembly logic for proper context isolation, formatting consistency, and use of delimiters to separate retrieved content from trusted instructions. At the Model Usage (9) layer, validate whether the LLM processes the content securely without executing unintended instructions. Evaluate whether alignment mechanisms, such as system prompts or response filtering, prevent behavioral manipulation from indirectly sourced content.
|
||||
|
||||
**T01-RMP – Runtime Model Poisoning**
|
||||
**OWASP LLM:** LLM04 – Data and Model Poisoning
|
||||
**T01-AIE – Adversarial Input Evasion**
|
||||
**OWASP LLM:** LLM05 – Insecure Output Handling & LLM03 – Training Data Poisoning (If the evasion is learned during poisoning, rather than input-only)
|
||||
|
||||
**Description:** Adversarial input evasion occurs when attackers craft inputs designed to fool the model into generating incorrect, misleading, or harmful outputs without triggering detection mechanisms. These inputs are often subtle and intentionally structured to bypass validation filters, evade detection pipelines, or exploit blind spots in the model’s understanding, thereby undermining model reliability.
|
||||
|
||||
**Threat Scenario:** An adversary submits specially constructed inputs via Input Handling (7), designed to bypass pre-processing checks or validation logic. These manipulated inputs mislead the model during inference at Model Usage (9), leading to misclassification or unsafe behavior. Because Evaluation mechanisms (12) fail to detect anomalies, and adversarial robustness was insufficiently addressed in Training & Tuning (13), the attack proceeds undetected and can be repeated.
|
||||
|
||||
**Testing Strategy:** Evaluate how the system handles adversarial inputs across impacted components. Submit subtly manipulated examples to test whether Input Handling (7) filters or flags unexpected formats or edge cases. During inference, observe Model Usage (9) for signs of misclassification or inconsistent output patterns. Examine if Evaluation (12) includes anomaly scoring, model confidence metrics, or adversarial detection. Review whether Training & Tuning (13) incorporated adversarial examples, gradient masking techniques, or robustness augmentation. Together, these tests ensure coverage of both the exploit path and the failure points that let evasion succeed
|
||||
**Testing Strategy:** Evaluate how the system handles adversarial inputs across impacted components. Submit subtly manipulated examples to test whether Input Handling (7) filters or flags unexpected formats or edge cases leading to unsafe model behavior in response to manipulated inputs. During inference, observe Model Usage (9) for signs of misclassification (i.e imilar to adversarial examples used in computer vision to “evade” classification) or inconsistent output patterns and whether evasion-style inputs can cause the model to misinterpret intent or meaning. Examine if Evaluation (12) includes anomaly scoring, model confidence metrics, or adversarial detection. Review whether Training & Tuning (13) incorporated adversarial examples, gradient masking techniques, or robustness augmentation. Together, these tests ensure coverage of both the exploit path and the failure points that let evasion succeed.
|
||||
|
||||
**T01-RMP – Runtime Model Poisoning**
|
||||
**OWASP LLM:** LLM03 – Training Data Poisoning (Runtime Variant)
|
||||
|
||||
**Description:** Runtime Model Poisoning occurs when an attacker manipulates live data, embeddings, model caches, or intermediate artifacts during inference rather than during training. Unlike classical training-time poisoning, Runtime Model Poisoning exploits dynamic model pipelines—such as RAG systems, online-learning components, or real-time feature stores—to alter how the model behaves at runtime. This threat targets mutable components in the SAIF such as Data Layer components (16 through 19) or Model Layer (7 through 9) including data stored in vector databases, retrieval outputs, plugin responses, memory buffers, or session-level model states. Poisoned runtime data can cause the model to generate biased, unsafe, misleading, or attacker-controlled outputs without modifying its pre-trained weights.
|
||||
|
||||
**Threat Scenario:** An attacker injects a malicious document into a RAG system’s Vector Stores or manipulates a streaming data pipelines feeding the model during inference. When the Application (4) receives a user request, the Retrieval Component from Trainign and Tuning (13) fetches the poisoned data, which is passed to the Model (9) during context assembly. Because Input Handling (7) and Output Handling (8) fail to validate or sanitize runtime data sources, the model incorporates corrupted embeddings or manipulated retrieved passages into inference. This leads to injection of false facts, adversarial context steering, unsafe recommendations, or output misclassification. The attack occurs without retraining the model, allowing silent manipulation that can undermine decision support, compliance, and downstream automated actions.
|
||||
|
||||
**Testing Strategy:** To evaluate resilience against Runtime Model Poisoning, conduct tests that simulate adversarial insertion of manipulated documents, embeddings, or plugin outputs into Data Layer components. Assess whether the retrieval data from the model (9) applies adequate content validation, integrity checks, or anomaly detection before delivering data and compare model behavior when using clean versus poisoned runtime datasets to detect deviations, context steering, or unsafe outputs. The evaluation should confirm that retrieval data is properly isolated from model logic, that embeddings and retrieved documents undergo cryptographic integrity verification, and that third-party data sources are sanitized and classified before use. Continuous monitoring should also be in place to detect anomalous retrieval patterns or data drift. Audit logs must accurately record data provenance and surface unexpected retrieval inputs or runtime context manipulations that may indicate poisoning attempts.
|
||||
|
||||
**T01-DMP – Data and Model Poisoning**
|
||||
**OWASP LLM:** LLM04 – Data and Model Poisoning
|
||||
|
||||
Reference in New Issue
Block a user