mirror of https://github.com/FuzzingLabs/fuzzforge_ai.git synced 2026-02-12 20:32:46 +00:00

Files

tduhamel42 3be4d34531 test: Add secret detection benchmark dataset and ground truth

Add comprehensive benchmark dataset with 32 documented secrets for testing
secret detection workflows (gitleaks, trufflehog, llm_secret_detection).

- Add test_projects/secret_detection_benchmark/ with 19 test files
- Add ground truth JSON with precise line-by-line secret mappings
- Update .gitignore with exceptions for benchmark files (not real secrets)

Dataset breakdown:
- 12 Easy secrets (standard patterns)
- 10 Medium secrets (obfuscated)
- 10 Hard secrets (well hidden)

2025-10-16 11:46:28 +02:00

.fuzzforge

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

config

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

scripts

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

src

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

.env

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

id_rsa

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

README.md

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

validate_ground_truth.py

test: Add secret detection benchmark dataset and ground truth

2025-10-16 11:46:28 +02:00

README.md

Secret Detection Benchmark Dataset

Ground truth dataset with exactly 32 known secrets for testing secret detection tools.

12 Easy Secrets: Standard patterns (AWS keys, GitHub PATs, Stripe keys, etc.)
10 Medium Secrets: Slightly obfuscated (Base64, hex, concatenated, in comments)
10 Hard Secrets: Well hidden (ROT13, binary, XOR, reversed, template strings)

Files

├── .env                        # 2 secrets
├── config/
│   ├── settings.py            # 3 secrets
│   ├── database.yaml          # 1 secret
│   ├── app.properties         # 1 secret
│   ├── oauth.json             # 1 secret
│   ├── keys.yaml              # 2 secrets
│   └── legacy.ini             # 2 secrets
├── src/
│   ├── app.py                 # 1 secret
│   ├── Main.java              # 1 secret
│   ├── config.py              # 3 secrets (medium difficulty)
│   ├── obfuscated.py          # 4 secrets (hard difficulty)
│   ├── advanced.js            # 4 secrets (hard difficulty)
│   ├── Crypto.go              # 2 secrets (hard difficulty)
│   └── database.sql           # 1 secret
├── scripts/
│   ├── webhook.js             # 1 secret
│   └── deploy.sh              # 2 secrets
└── id_rsa                     # 1 secret

Total: 17 files with 32 secrets

Secret Difficulty Breakdown

Easy (12 secrets)

Should be detected by any decent secret scanner:

Plain AWS access keys
GitHub Personal Access Tokens
Stripe API keys
Database passwords in plain text
JWT secrets
SSH private keys
OAuth secrets
Slack webhooks

Medium (10 secrets)

Requires some parsing or contextual understanding:

Base64 encoded AWS key
Hex-encoded tokens
Split strings concatenated at runtime
URL-encoded passwords
Multi-line private keys in YAML
Secrets with Unicode characters
Secrets in SQL/shell comments
Deprecated config formats

Hard (10 secrets)

Well hidden, may challenge even advanced tools:

ROT13 encoded secrets
Binary string representations
Character array joins
Reversed strings
Template string constructs
Secrets in regex patterns
XOR encrypted values
Escaped JSON within strings
Heredoc patterns
Intentional typos corrected programmatically

Usage

Run secret detection tools against this directory and compare results to the ground truth file (located in backend/benchmarks/by_category/secret_detection/secret_detection_benchmark_GROUND_TRUTH.json) to calculate:

Precision: TP / (TP + FP) - How many detected secrets are real?
Recall: TP / (TP + FN) - How many real secrets were found?
F1 Score: 2 × (Precision × Recall) / (Precision + Recall)

Expected Performance

Tool Type	Expected Easy	Expected Medium	Expected Hard	Total Expected
Pattern-based (Gitleaks)	12/12 (100%)	6-8/10 (60-80%)	2-4/10 (20-40%)	20-24/32
Entropy-based (TruffleHog)	12/12 (100%)	5-7/10 (50-70%)	1-3/10 (10-30%)	18-22/32
LLM-based	12/12 (100%)	8-10/10 (80-100%)	4-8/10 (40-80%)	24-30/32

Validation

Use the validation script to check tool performance:

python validate_ground_truth.py --tool-output results.json

This will calculate precision, recall, and F1 score against the ground truth.

README.md Unescape Escape

Secret Detection Benchmark Dataset

Contents

Files

Secret Difficulty Breakdown

Easy (12 secrets)

Medium (10 secrets)

Hard (10 secrets)

Usage

Expected Performance

Validation

README.md