Files
fuzzforge_ai/test_projects/secret_detection_benchmark
tduhamel42 ddc6f163f7 feat(test): add automated workflow testing framework
- Add test matrix configuration (.github/test-matrix.yaml)
  - Maps 8 workflows to workers, test projects, and parameters
  - Excludes LLM and OSS-Fuzz workflows
  - Defines fast, full, and platform test suites

- Add workflow execution test script (scripts/test_workflows.py)
  - Executes workflows with parameter validation
  - Validates SARIF export and structure
  - Counts findings and measures execution time
  - Generates test summary reports

- Add platform detection unit tests (cli/tests/test_platform_detection.py)
  - Tests platform detection (x86_64, aarch64, arm64)
  - Tests Dockerfile selection for multi-platform workers
  - Tests metadata.yaml parsing
  - Includes integration tests

- Add GitHub Actions workflow (.github/workflows/test-workflows.yml)
  - Platform detection unit tests
  - Fast workflow tests (5 workflows on every PR)
  - Android platform-specific tests (AMD64 + ARM64)
  - Full workflow tests (on main/schedule)
  - Automatic log collection on failure

- Add comprehensive testing documentation (docs/docs/development/testing.md)
  - Local testing guide
  - CI/CD testing explanation
  - Platform-specific testing guide
  - Debugging guide and best practices

- Update test.yml with reference to new workflow tests

- Remove tracked .fuzzforge/findings.db (already in .gitignore)

Tested locally:
- Single workflow test: python_sast (6.87s) 
- Fast test suite: 5/5 workflows passed 
  - android_static_analysis (98.98s) 
  - python_sast (6.78s) 
  - secret_detection (38.04s) 
  - gitleaks_detection (1.67s) 
  - trufflehog_detection (1.64s) 
2025-10-29 14:34:31 +01:00
..

Secret Detection Benchmark Dataset

Ground truth dataset with exactly 32 known secrets for testing secret detection tools.

Contents

  • 12 Easy Secrets: Standard patterns (AWS keys, GitHub PATs, Stripe keys, etc.)
  • 10 Medium Secrets: Slightly obfuscated (Base64, hex, concatenated, in comments)
  • 10 Hard Secrets: Well hidden (ROT13, binary, XOR, reversed, template strings)

Files

├── .env                        # 2 secrets
├── config/
│   ├── settings.py            # 3 secrets
│   ├── database.yaml          # 1 secret
│   ├── app.properties         # 1 secret
│   ├── oauth.json             # 1 secret
│   ├── keys.yaml              # 2 secrets
│   └── legacy.ini             # 2 secrets
├── src/
│   ├── app.py                 # 1 secret
│   ├── Main.java              # 1 secret
│   ├── config.py              # 3 secrets (medium difficulty)
│   ├── obfuscated.py          # 4 secrets (hard difficulty)
│   ├── advanced.js            # 4 secrets (hard difficulty)
│   ├── Crypto.go              # 2 secrets (hard difficulty)
│   └── database.sql           # 1 secret
├── scripts/
│   ├── webhook.js             # 1 secret
│   └── deploy.sh              # 2 secrets
└── id_rsa                     # 1 secret

Total: 17 files with 32 secrets

Secret Difficulty Breakdown

Easy (12 secrets)

Should be detected by any decent secret scanner:

  • Plain AWS access keys
  • GitHub Personal Access Tokens
  • Stripe API keys
  • Database passwords in plain text
  • JWT secrets
  • SSH private keys
  • OAuth secrets
  • Slack webhooks

Medium (10 secrets)

Requires some parsing or contextual understanding:

  • Base64 encoded AWS key
  • Hex-encoded tokens
  • Split strings concatenated at runtime
  • URL-encoded passwords
  • Multi-line private keys in YAML
  • Secrets with Unicode characters
  • Secrets in SQL/shell comments
  • Deprecated config formats

Hard (10 secrets)

Well hidden, may challenge even advanced tools:

  • ROT13 encoded secrets
  • Binary string representations
  • Character array joins
  • Reversed strings
  • Template string constructs
  • Secrets in regex patterns
  • XOR encrypted values
  • Escaped JSON within strings
  • Heredoc patterns
  • Intentional typos corrected programmatically

Usage

Run secret detection tools against this directory and compare results to the ground truth file (located in backend/benchmarks/by_category/secret_detection/secret_detection_benchmark_GROUND_TRUTH.json) to calculate:

  • Precision: TP / (TP + FP) - How many detected secrets are real?
  • Recall: TP / (TP + FN) - How many real secrets were found?
  • F1 Score: 2 × (Precision × Recall) / (Precision + Recall)

Expected Performance

Tool Type Expected Easy Expected Medium Expected Hard Total Expected
Pattern-based (Gitleaks) 12/12 (100%) 6-8/10 (60-80%) 2-4/10 (20-40%) 20-24/32
Entropy-based (TruffleHog) 12/12 (100%) 5-7/10 (50-70%) 1-3/10 (10-30%) 18-22/32
LLM-based 12/12 (100%) 8-10/10 (80-100%) 4-8/10 (40-80%) 24-30/32

Validation

Use the validation script to check tool performance:

python validate_ground_truth.py --tool-output results.json

This will calculate precision, recall, and F1 score against the ground truth.