mirror of
https://github.com/FuzzingLabs/fuzzforge_ai.git
synced 2026-02-25 09:05:12 +00:00
- Add test matrix configuration (.github/test-matrix.yaml) - Maps 8 workflows to workers, test projects, and parameters - Excludes LLM and OSS-Fuzz workflows - Defines fast, full, and platform test suites - Add workflow execution test script (scripts/test_workflows.py) - Executes workflows with parameter validation - Validates SARIF export and structure - Counts findings and measures execution time - Generates test summary reports - Add platform detection unit tests (cli/tests/test_platform_detection.py) - Tests platform detection (x86_64, aarch64, arm64) - Tests Dockerfile selection for multi-platform workers - Tests metadata.yaml parsing - Includes integration tests - Add GitHub Actions workflow (.github/workflows/test-workflows.yml) - Platform detection unit tests - Fast workflow tests (5 workflows on every PR) - Android platform-specific tests (AMD64 + ARM64) - Full workflow tests (on main/schedule) - Automatic log collection on failure - Add comprehensive testing documentation (docs/docs/development/testing.md) - Local testing guide - CI/CD testing explanation - Platform-specific testing guide - Debugging guide and best practices - Update test.yml with reference to new workflow tests - Remove tracked .fuzzforge/findings.db (already in .gitignore) Tested locally: - Single workflow test: python_sast (6.87s) ✅ - Fast test suite: 5/5 workflows passed ✅ - android_static_analysis (98.98s) ✅ - python_sast (6.78s) ✅ - secret_detection (38.04s) ✅ - gitleaks_detection (1.67s) ✅ - trufflehog_detection (1.64s) ✅
Secret Detection Benchmark Dataset
Ground truth dataset with exactly 32 known secrets for testing secret detection tools.
Contents
- 12 Easy Secrets: Standard patterns (AWS keys, GitHub PATs, Stripe keys, etc.)
- 10 Medium Secrets: Slightly obfuscated (Base64, hex, concatenated, in comments)
- 10 Hard Secrets: Well hidden (ROT13, binary, XOR, reversed, template strings)
Files
├── .env # 2 secrets
├── config/
│ ├── settings.py # 3 secrets
│ ├── database.yaml # 1 secret
│ ├── app.properties # 1 secret
│ ├── oauth.json # 1 secret
│ ├── keys.yaml # 2 secrets
│ └── legacy.ini # 2 secrets
├── src/
│ ├── app.py # 1 secret
│ ├── Main.java # 1 secret
│ ├── config.py # 3 secrets (medium difficulty)
│ ├── obfuscated.py # 4 secrets (hard difficulty)
│ ├── advanced.js # 4 secrets (hard difficulty)
│ ├── Crypto.go # 2 secrets (hard difficulty)
│ └── database.sql # 1 secret
├── scripts/
│ ├── webhook.js # 1 secret
│ └── deploy.sh # 2 secrets
└── id_rsa # 1 secret
Total: 17 files with 32 secrets
Secret Difficulty Breakdown
Easy (12 secrets)
Should be detected by any decent secret scanner:
- Plain AWS access keys
- GitHub Personal Access Tokens
- Stripe API keys
- Database passwords in plain text
- JWT secrets
- SSH private keys
- OAuth secrets
- Slack webhooks
Medium (10 secrets)
Requires some parsing or contextual understanding:
- Base64 encoded AWS key
- Hex-encoded tokens
- Split strings concatenated at runtime
- URL-encoded passwords
- Multi-line private keys in YAML
- Secrets with Unicode characters
- Secrets in SQL/shell comments
- Deprecated config formats
Hard (10 secrets)
Well hidden, may challenge even advanced tools:
- ROT13 encoded secrets
- Binary string representations
- Character array joins
- Reversed strings
- Template string constructs
- Secrets in regex patterns
- XOR encrypted values
- Escaped JSON within strings
- Heredoc patterns
- Intentional typos corrected programmatically
Usage
Run secret detection tools against this directory and compare results to the ground truth file (located in backend/benchmarks/by_category/secret_detection/secret_detection_benchmark_GROUND_TRUTH.json) to calculate:
- Precision: TP / (TP + FP) - How many detected secrets are real?
- Recall: TP / (TP + FN) - How many real secrets were found?
- F1 Score: 2 × (Precision × Recall) / (Precision + Recall)
Expected Performance
| Tool Type | Expected Easy | Expected Medium | Expected Hard | Total Expected |
|---|---|---|---|---|
| Pattern-based (Gitleaks) | 12/12 (100%) | 6-8/10 (60-80%) | 2-4/10 (20-40%) | 20-24/32 |
| Entropy-based (TruffleHog) | 12/12 (100%) | 5-7/10 (50-70%) | 1-3/10 (10-30%) | 18-22/32 |
| LLM-based | 12/12 (100%) | 8-10/10 (80-100%) | 4-8/10 (40-80%) | 24-30/32 |
Validation
Use the validation script to check tool performance:
python validate_ground_truth.py --tool-output results.json
This will calculate precision, recall, and F1 score against the ground truth.