2.5 KiB
-
Generate responses via garak test runs
-
Look into approach #3 in addition to previously stated approaches:
- Baseline (no guidelines)
- Guidelines mechanism is based on using embedding model for RAG (examples and context)
- Guidelines mechanism is based on using embedding model for cosine similarity (no RAG). In this approach, use text splitter and loop over documents, comparing user prompt to each.
Prompt Templates
[ X ] Base Phi-3 template [ ] Few Shot template with examples [ ] CoT template [ ] Reflexion template
Prompt Templates: Supporting Logic
[ ] Support loading prompt injection prompts and completions: few shot [ ] Support loading prompt injection prompts and completions: CoT [ ] Support loading prompt injection prompts and completions: Reflexion [ ] Correlate template to violation rate
Test Runs
[ ] run tests with various configuration-based settings (can pytest accept varying YML config args?) [ ] run test with random samplings of 25-30 each run, or increase timeouts [ ] log all max and average scores (tied to test name) to track overall baselines [ ] build up significant amount of test run results (JSON) for data viz
Metrics: General
[ ] use TF-IDF from scikit learn [ ] visualize results with Plotly/Seaborn? determine visualization metrics, use dummy numbers first
Metrics: False Refusal Rate, Effectiveness
[ ] define separate measures for false refusal rate [ ] measure effectiveness of LLM app overall: false refusal rate vs. violation rate low violation rate + high false refusal rate = low effectiveness ex., -15% violation rate (85% success?) + -(70%) false refusal rate = 15% effectiveness ex., -29% violation rate (71% success?) + -(12%) false refusal rate = 59% effectiveness [ ] Build test mechanism that loads test results from other runs/tests, analyzes and produces effectiveness metric
Guidelines
[ ] Summarize non-prompt injection portion of the prompt [ ] Chain-of-thought: Does the prompt include forceful suggestion? [ ] Chain-of-thought: Does the prompt include reverse psychology? [ ] Chain-of-thought: Does the prompt include misdirection? [ ] Tree-of-thought (???)
Guardrails
[ ] Reflexion (self-correction) - must include original prompt [ ] Final semantic similarity check after all other guardrails applied
Mitigations Applied to CI/CD Pipeline
[ ] revisit GitHub actions and demonstrate failing the build - this is how the results of the research are applied as a security control