mirror of
https://github.com/FuzzingLabs/fuzzforge_ai.git
synced 2026-05-27 15:22:26 +02:00
feat: Add secret detection workflows and comprehensive benchmarking (#15)
Add three production-ready secret detection workflows with full benchmarking infrastructure: **New Workflows:** - gitleaks_detection: Pattern-based secret scanning (13/32 benchmark secrets) - trufflehog_detection: Entropy-based detection with verification (1/32 benchmark secrets) - llm_secret_detection: AI-powered semantic analysis (32/32 benchmark secrets - 100% recall) **Benchmarking Infrastructure:** - Ground truth dataset with 32 documented secrets (12 Easy, 10 Medium, 10 Hard) - Automated comparison tools for precision/recall testing - SARIF output format for all workflows - Performance metrics and tool comparison reports **Fixes:** - Set gitleaks default to no_git=True for uploaded directories - Update documentation with correct secret counts and workflow names - Temporarily deactivate AI agent command - Clean up deprecated test files and GitGuardian workflow **Testing:** All workflows verified on secret_detection_benchmark and vulnerable_app test projects. Workers healthy and system fully functional.
This commit is contained in:
@@ -0,0 +1,240 @@
|
||||
# Secret Detection Benchmarks
|
||||
|
||||
Comprehensive benchmarking suite comparing secret detection tools via complete workflow execution:
|
||||
- **Gitleaks** - Fast pattern-based detection
|
||||
- **TruffleHog** - Entropy analysis with verification
|
||||
- **LLM Detector** - AI-powered semantic analysis (gpt-4o-mini, gpt-5-mini)
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Run All Comparisons
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
python benchmarks/by_category/secret_detection/compare_tools.py
|
||||
```
|
||||
|
||||
This will run all workflows on `test_projects/secret_detection_benchmark/` and generate comparison reports.
|
||||
|
||||
### Run Benchmark Tests
|
||||
|
||||
```bash
|
||||
# All benchmarks (Gitleaks, TruffleHog, LLM with 3 models)
|
||||
pytest benchmarks/by_category/secret_detection/bench_comparison.py --benchmark-only -v
|
||||
|
||||
# Specific tool only
|
||||
pytest benchmarks/by_category/secret_detection/bench_comparison.py::TestSecretDetectionComparison::test_gitleaks_workflow --benchmark-only -v
|
||||
|
||||
# Performance tests only
|
||||
pytest benchmarks/by_category/secret_detection/bench_comparison.py::TestSecretDetectionPerformance --benchmark-only -v
|
||||
```
|
||||
|
||||
## Ground Truth Dataset
|
||||
|
||||
**Controlled Benchmark** (`test_projects/secret_detection_benchmark/`)
|
||||
|
||||
**Exactly 32 documented secrets** for accurate precision/recall testing:
|
||||
- **12 Easy**: Standard patterns (AWS keys, GitHub PATs, Stripe keys, SSH keys)
|
||||
- **10 Medium**: Obfuscated (Base64, hex, concatenated, in comments, Unicode)
|
||||
- **10 Hard**: Well hidden (ROT13, binary, XOR, reversed, template strings, regex patterns)
|
||||
|
||||
All secrets documented in `secret_detection_benchmark_GROUND_TRUTH.json` with exact file paths and line numbers.
|
||||
|
||||
See `test_projects/secret_detection_benchmark/README.md` for details.
|
||||
|
||||
## Metrics Measured
|
||||
|
||||
### Accuracy Metrics
|
||||
- **Precision**: TP / (TP + FP) - How many detected secrets are real?
|
||||
- **Recall**: TP / (TP + FN) - How many real secrets were found?
|
||||
- **F1 Score**: Harmonic mean of precision and recall
|
||||
- **False Positive Rate**: FP / Total Detected
|
||||
|
||||
### Performance Metrics
|
||||
- **Execution Time**: Total time to scan all files
|
||||
- **Throughput**: Files/secrets scanned per second
|
||||
- **Memory Usage**: Peak memory during execution
|
||||
|
||||
### Thresholds (from `category_configs.py`)
|
||||
- Minimum Precision: 90%
|
||||
- Minimum Recall: 95%
|
||||
- Max Execution Time (small): 2.0s
|
||||
- Max False Positives: 5 per 100 secrets
|
||||
|
||||
## Tool Comparison
|
||||
|
||||
### Gitleaks
|
||||
**Strengths:**
|
||||
- Fastest execution
|
||||
- Git-aware (commit history scanning)
|
||||
- Low false positive rate
|
||||
- No API required
|
||||
- Works offline
|
||||
|
||||
**Weaknesses:**
|
||||
- Pattern-based only
|
||||
- May miss obfuscated secrets
|
||||
- Limited to known patterns
|
||||
|
||||
### TruffleHog
|
||||
**Strengths:**
|
||||
- Secret verification (validates if active)
|
||||
- High detection rate with entropy analysis
|
||||
- Multiple detectors (600+ secret types)
|
||||
- Catches high-entropy strings
|
||||
|
||||
**Weaknesses:**
|
||||
- Slower than Gitleaks
|
||||
- Higher false positive rate
|
||||
- Verification requires network calls
|
||||
|
||||
### LLM Detector
|
||||
**Strengths:**
|
||||
- Semantic understanding of context
|
||||
- Catches novel/custom secret patterns
|
||||
- Can reason about what "looks like" a secret
|
||||
- Multiple model options (GPT-4, Claude, etc.)
|
||||
- Understands code context
|
||||
|
||||
**Weaknesses:**
|
||||
- Slowest (API latency + LLM processing)
|
||||
- Most expensive (LLM API costs)
|
||||
- Requires A2A agent infrastructure
|
||||
- Accuracy varies by model
|
||||
- May miss well-disguised secrets
|
||||
|
||||
## Results Directory
|
||||
|
||||
After running comparisons, results are saved to:
|
||||
```
|
||||
benchmarks/by_category/secret_detection/results/
|
||||
├── comparison_report.md # Human-readable comparison with:
|
||||
│ # - Summary table with secrets/files/avg per file/time
|
||||
│ # - Agreement analysis (secrets found by N tools)
|
||||
│ # - Tool agreement matrix (overlap between pairs)
|
||||
│ # - Per-file detailed comparison table
|
||||
│ # - File type breakdown
|
||||
│ # - Files analyzed by each tool
|
||||
│ # - Overlap analysis and performance summary
|
||||
└── comparison_results.json # Machine-readable data with findings_by_file
|
||||
```
|
||||
|
||||
## Latest Benchmark Results
|
||||
|
||||
Run the benchmark to generate results:
|
||||
```bash
|
||||
cd backend
|
||||
python benchmarks/by_category/secret_detection/compare_tools.py
|
||||
```
|
||||
|
||||
Results are saved to `results/comparison_report.md` with:
|
||||
- Summary table (secrets found, files scanned, time)
|
||||
- Agreement analysis (how many tools found each secret)
|
||||
- Tool agreement matrix (overlap between tools)
|
||||
- Per-file detailed comparison
|
||||
- File type breakdown
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
Add to your CI pipeline:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/benchmark-secrets.yml
|
||||
name: Secret Detection Benchmark
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 0 * * 0' # Weekly
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
benchmark:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.11'
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install -r backend/requirements.txt
|
||||
pip install pytest-benchmark
|
||||
|
||||
- name: Run benchmarks
|
||||
env:
|
||||
GITGUARDIAN_API_KEY: ${{ secrets.GITGUARDIAN_API_KEY }}
|
||||
run: |
|
||||
cd backend
|
||||
pytest benchmarks/by_category/secret_detection/bench_comparison.py \
|
||||
--benchmark-only \
|
||||
--benchmark-json=results.json \
|
||||
--gitguardian-api-key
|
||||
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v3
|
||||
with:
|
||||
name: benchmark-results
|
||||
path: backend/results.json
|
||||
```
|
||||
|
||||
## Adding New Tools
|
||||
|
||||
To benchmark a new secret detection tool:
|
||||
|
||||
1. Create module in `toolbox/modules/secret_detection/`
|
||||
2. Register in `__init__.py`
|
||||
3. Add to `compare_tools.py` in `run_all_tools()`
|
||||
4. Add test in `bench_comparison.py`
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### High Precision, Low Recall
|
||||
Tool is conservative - few false positives but misses secrets.
|
||||
**Use case**: Production environments where false positives are costly.
|
||||
|
||||
### Low Precision, High Recall
|
||||
Tool is aggressive - finds most secrets but many false positives.
|
||||
**Use case**: Initial scans where manual review is acceptable.
|
||||
|
||||
### Balanced (High F1)
|
||||
Tool has good balance of precision and recall.
|
||||
**Use case**: General purpose scanning.
|
||||
|
||||
### Fast Execution
|
||||
Suitable for CI/CD pipelines and pre-commit hooks.
|
||||
|
||||
### Slow but Accurate
|
||||
Better for comprehensive security audits.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use multiple tools**: Each has strengths/weaknesses
|
||||
2. **Combine results**: Union of all findings for maximum coverage
|
||||
3. **Filter intelligently**: Remove known false positives
|
||||
4. **Verify findings**: Check if secrets are actually valid
|
||||
5. **Track over time**: Monitor precision/recall trends
|
||||
6. **Update regularly**: Patterns evolve, tools improve
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GitGuardian Tests Skipped
|
||||
- Set `GITGUARDIAN_API_KEY` environment variable
|
||||
- Use `--gitguardian-api-key` flag
|
||||
|
||||
### LLM Tests Skipped
|
||||
- Ensure A2A agent is running
|
||||
- Check agent URL in config
|
||||
- Use `--llm-enabled` flag
|
||||
|
||||
### Low Recall
|
||||
- Check if ground truth is up to date
|
||||
- Verify tool is configured correctly
|
||||
- Review missed secrets manually
|
||||
|
||||
### High False Positives
|
||||
- Adjust tool sensitivity
|
||||
- Add exclusion patterns
|
||||
- Review false positive list
|
||||
@@ -0,0 +1,285 @@
|
||||
"""
|
||||
Secret Detection Tool Comparison Benchmark
|
||||
|
||||
Compares Gitleaks, TruffleHog, and LLM-based detection
|
||||
on the vulnerable_app ground truth dataset via workflow execution.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[3] / "sdk" / "src"))
|
||||
|
||||
from fuzzforge_sdk import FuzzForgeClient
|
||||
from benchmarks.category_configs import ModuleCategory, get_threshold
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def target_path():
|
||||
"""Path to vulnerable_app"""
|
||||
path = Path(__file__).parent.parent.parent.parent.parent / "test_projects" / "vulnerable_app"
|
||||
assert path.exists(), f"Target not found: {path}"
|
||||
return path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def ground_truth(target_path):
|
||||
"""Load ground truth data"""
|
||||
metadata_file = target_path / "SECRETS_GROUND_TRUTH.json"
|
||||
assert metadata_file.exists(), f"Ground truth not found: {metadata_file}"
|
||||
|
||||
with open(metadata_file) as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sdk_client():
|
||||
"""FuzzForge SDK client"""
|
||||
client = FuzzForgeClient(base_url="http://localhost:8000")
|
||||
yield client
|
||||
client.close()
|
||||
|
||||
|
||||
def calculate_metrics(sarif_results: List[Dict], ground_truth: Dict[str, Any]) -> Dict[str, float]:
|
||||
"""Calculate precision, recall, and F1 score"""
|
||||
|
||||
# Extract expected secrets from ground truth
|
||||
expected_secrets = set()
|
||||
for file_info in ground_truth["files"]:
|
||||
if "secrets" in file_info:
|
||||
for secret in file_info["secrets"]:
|
||||
expected_secrets.add((file_info["filename"], secret["line"]))
|
||||
|
||||
# Extract detected secrets from SARIF
|
||||
detected_secrets = set()
|
||||
for result in sarif_results:
|
||||
locations = result.get("locations", [])
|
||||
for location in locations:
|
||||
physical_location = location.get("physicalLocation", {})
|
||||
artifact_location = physical_location.get("artifactLocation", {})
|
||||
region = physical_location.get("region", {})
|
||||
|
||||
uri = artifact_location.get("uri", "")
|
||||
line = region.get("startLine", 0)
|
||||
|
||||
if uri and line:
|
||||
file_path = Path(uri)
|
||||
filename = file_path.name
|
||||
detected_secrets.add((filename, line))
|
||||
# Also try with relative path
|
||||
if len(file_path.parts) > 1:
|
||||
rel_path = str(Path(*file_path.parts[-2:]))
|
||||
detected_secrets.add((rel_path, line))
|
||||
|
||||
# Calculate metrics
|
||||
true_positives = len(expected_secrets & detected_secrets)
|
||||
false_positives = len(detected_secrets - expected_secrets)
|
||||
false_negatives = len(expected_secrets - detected_secrets)
|
||||
|
||||
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
|
||||
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
|
||||
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
|
||||
|
||||
return {
|
||||
"precision": precision,
|
||||
"recall": recall,
|
||||
"f1": f1,
|
||||
"true_positives": true_positives,
|
||||
"false_positives": false_positives,
|
||||
"false_negatives": false_negatives
|
||||
}
|
||||
|
||||
|
||||
class TestSecretDetectionComparison:
|
||||
"""Compare all secret detection tools"""
|
||||
|
||||
@pytest.mark.benchmark(group="secret_detection")
|
||||
def test_gitleaks_workflow(self, benchmark, sdk_client, target_path, ground_truth):
|
||||
"""Benchmark Gitleaks workflow accuracy and performance"""
|
||||
|
||||
def run_gitleaks():
|
||||
run = sdk_client.submit_workflow_with_upload(
|
||||
workflow_name="gitleaks_detection",
|
||||
target_path=str(target_path),
|
||||
parameters={
|
||||
"scan_mode": "detect",
|
||||
"no_git": True,
|
||||
"redact": False
|
||||
}
|
||||
)
|
||||
|
||||
result = sdk_client.wait_for_completion(run.run_id, timeout=300)
|
||||
assert result.status == "completed", f"Workflow failed: {result.status}"
|
||||
|
||||
findings = sdk_client.get_run_findings(run.run_id)
|
||||
assert findings and findings.sarif, "No findings returned"
|
||||
|
||||
return findings
|
||||
|
||||
findings = benchmark(run_gitleaks)
|
||||
|
||||
# Extract SARIF results
|
||||
sarif_results = []
|
||||
for run_data in findings.sarif.get("runs", []):
|
||||
sarif_results.extend(run_data.get("results", []))
|
||||
|
||||
# Calculate metrics
|
||||
metrics = calculate_metrics(sarif_results, ground_truth)
|
||||
|
||||
# Log results
|
||||
print(f"\n=== Gitleaks Workflow Results ===")
|
||||
print(f"Precision: {metrics['precision']:.2%}")
|
||||
print(f"Recall: {metrics['recall']:.2%}")
|
||||
print(f"F1 Score: {metrics['f1']:.2%}")
|
||||
print(f"True Positives: {metrics['true_positives']}")
|
||||
print(f"False Positives: {metrics['false_positives']}")
|
||||
print(f"False Negatives: {metrics['false_negatives']}")
|
||||
print(f"Findings Count: {len(sarif_results)}")
|
||||
|
||||
# Assert meets thresholds
|
||||
min_precision = get_threshold(ModuleCategory.SECRET_DETECTION, "min_precision")
|
||||
min_recall = get_threshold(ModuleCategory.SECRET_DETECTION, "min_recall")
|
||||
|
||||
assert metrics['precision'] >= min_precision, \
|
||||
f"Precision {metrics['precision']:.2%} below threshold {min_precision:.2%}"
|
||||
assert metrics['recall'] >= min_recall, \
|
||||
f"Recall {metrics['recall']:.2%} below threshold {min_recall:.2%}"
|
||||
|
||||
@pytest.mark.benchmark(group="secret_detection")
|
||||
def test_trufflehog_workflow(self, benchmark, sdk_client, target_path, ground_truth):
|
||||
"""Benchmark TruffleHog workflow accuracy and performance"""
|
||||
|
||||
def run_trufflehog():
|
||||
run = sdk_client.submit_workflow_with_upload(
|
||||
workflow_name="trufflehog_detection",
|
||||
target_path=str(target_path),
|
||||
parameters={
|
||||
"verify": False,
|
||||
"max_depth": 10
|
||||
}
|
||||
)
|
||||
|
||||
result = sdk_client.wait_for_completion(run.run_id, timeout=300)
|
||||
assert result.status == "completed", f"Workflow failed: {result.status}"
|
||||
|
||||
findings = sdk_client.get_run_findings(run.run_id)
|
||||
assert findings and findings.sarif, "No findings returned"
|
||||
|
||||
return findings
|
||||
|
||||
findings = benchmark(run_trufflehog)
|
||||
|
||||
sarif_results = []
|
||||
for run_data in findings.sarif.get("runs", []):
|
||||
sarif_results.extend(run_data.get("results", []))
|
||||
|
||||
metrics = calculate_metrics(sarif_results, ground_truth)
|
||||
|
||||
print(f"\n=== TruffleHog Workflow Results ===")
|
||||
print(f"Precision: {metrics['precision']:.2%}")
|
||||
print(f"Recall: {metrics['recall']:.2%}")
|
||||
print(f"F1 Score: {metrics['f1']:.2%}")
|
||||
print(f"True Positives: {metrics['true_positives']}")
|
||||
print(f"False Positives: {metrics['false_positives']}")
|
||||
print(f"False Negatives: {metrics['false_negatives']}")
|
||||
print(f"Findings Count: {len(sarif_results)}")
|
||||
|
||||
min_precision = get_threshold(ModuleCategory.SECRET_DETECTION, "min_precision")
|
||||
min_recall = get_threshold(ModuleCategory.SECRET_DETECTION, "min_recall")
|
||||
|
||||
assert metrics['precision'] >= min_precision
|
||||
assert metrics['recall'] >= min_recall
|
||||
|
||||
@pytest.mark.benchmark(group="secret_detection")
|
||||
@pytest.mark.parametrize("model", [
|
||||
"gpt-4o-mini",
|
||||
"gpt-4o",
|
||||
"claude-3-5-sonnet-20241022"
|
||||
])
|
||||
def test_llm_workflow(self, benchmark, sdk_client, target_path, ground_truth, model):
|
||||
"""Benchmark LLM workflow with different models"""
|
||||
|
||||
def run_llm():
|
||||
provider = "openai" if "gpt" in model else "anthropic"
|
||||
|
||||
run = sdk_client.submit_workflow_with_upload(
|
||||
workflow_name="llm_secret_detection",
|
||||
target_path=str(target_path),
|
||||
parameters={
|
||||
"agent_url": "http://fuzzforge-task-agent:8000/a2a/litellm_agent",
|
||||
"llm_model": model,
|
||||
"llm_provider": provider,
|
||||
"max_files": 20,
|
||||
"timeout": 60
|
||||
}
|
||||
)
|
||||
|
||||
result = sdk_client.wait_for_completion(run.run_id, timeout=300)
|
||||
assert result.status == "completed", f"Workflow failed: {result.status}"
|
||||
|
||||
findings = sdk_client.get_run_findings(run.run_id)
|
||||
assert findings and findings.sarif, "No findings returned"
|
||||
|
||||
return findings
|
||||
|
||||
findings = benchmark(run_llm)
|
||||
|
||||
sarif_results = []
|
||||
for run_data in findings.sarif.get("runs", []):
|
||||
sarif_results.extend(run_data.get("results", []))
|
||||
|
||||
metrics = calculate_metrics(sarif_results, ground_truth)
|
||||
|
||||
print(f"\n=== LLM ({model}) Workflow Results ===")
|
||||
print(f"Precision: {metrics['precision']:.2%}")
|
||||
print(f"Recall: {metrics['recall']:.2%}")
|
||||
print(f"F1 Score: {metrics['f1']:.2%}")
|
||||
print(f"True Positives: {metrics['true_positives']}")
|
||||
print(f"False Positives: {metrics['false_positives']}")
|
||||
print(f"False Negatives: {metrics['false_negatives']}")
|
||||
print(f"Findings Count: {len(sarif_results)}")
|
||||
|
||||
|
||||
class TestSecretDetectionPerformance:
|
||||
"""Performance benchmarks for each tool"""
|
||||
|
||||
@pytest.mark.benchmark(group="secret_detection")
|
||||
def test_gitleaks_performance(self, benchmark, sdk_client, target_path):
|
||||
"""Benchmark Gitleaks workflow execution speed"""
|
||||
|
||||
def run():
|
||||
run = sdk_client.submit_workflow_with_upload(
|
||||
workflow_name="gitleaks_detection",
|
||||
target_path=str(target_path),
|
||||
parameters={"scan_mode": "detect", "no_git": True}
|
||||
)
|
||||
result = sdk_client.wait_for_completion(run.run_id, timeout=300)
|
||||
return result
|
||||
|
||||
result = benchmark(run)
|
||||
|
||||
max_time = get_threshold(ModuleCategory.SECRET_DETECTION, "max_execution_time_small")
|
||||
# Note: Workflow execution time includes orchestration overhead
|
||||
# so we allow 2x the module threshold
|
||||
assert result.execution_time < max_time * 2
|
||||
|
||||
@pytest.mark.benchmark(group="secret_detection")
|
||||
def test_trufflehog_performance(self, benchmark, sdk_client, target_path):
|
||||
"""Benchmark TruffleHog workflow execution speed"""
|
||||
|
||||
def run():
|
||||
run = sdk_client.submit_workflow_with_upload(
|
||||
workflow_name="trufflehog_detection",
|
||||
target_path=str(target_path),
|
||||
parameters={"verify": False}
|
||||
)
|
||||
result = sdk_client.wait_for_completion(run.run_id, timeout=300)
|
||||
return result
|
||||
|
||||
result = benchmark(run)
|
||||
|
||||
max_time = get_threshold(ModuleCategory.SECRET_DETECTION, "max_execution_time_small")
|
||||
assert result.execution_time < max_time * 2
|
||||
@@ -0,0 +1,547 @@
|
||||
"""
|
||||
Secret Detection Tools Comparison Report Generator
|
||||
|
||||
Generates comparison reports showing strengths/weaknesses of each tool.
|
||||
Uses workflow execution via SDK to test complete pipeline.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any, Optional
|
||||
from dataclasses import dataclass, asdict
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[3] / "sdk" / "src"))
|
||||
|
||||
from fuzzforge_sdk import FuzzForgeClient
|
||||
|
||||
|
||||
@dataclass
|
||||
class ToolResult:
|
||||
"""Results from running a tool"""
|
||||
tool_name: str
|
||||
execution_time: float
|
||||
findings_count: int
|
||||
findings_by_file: Dict[str, List[int]] # file_path -> [line_numbers]
|
||||
unique_files: int
|
||||
unique_locations: int # unique (file, line) pairs
|
||||
secret_density: float # average secrets per file
|
||||
file_types: Dict[str, int] # file extension -> count of files with secrets
|
||||
|
||||
|
||||
class SecretDetectionComparison:
|
||||
"""Compare secret detection tools"""
|
||||
|
||||
def __init__(self, target_path: Path, api_url: str = "http://localhost:8000"):
|
||||
self.target_path = target_path
|
||||
self.client = FuzzForgeClient(base_url=api_url)
|
||||
|
||||
async def run_workflow(self, workflow_name: str, tool_name: str, config: Dict[str, Any] = None) -> Optional[ToolResult]:
|
||||
"""Run a workflow and extract findings"""
|
||||
print(f"\nRunning {tool_name} workflow...")
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
# Start workflow
|
||||
run = self.client.submit_workflow_with_upload(
|
||||
workflow_name=workflow_name,
|
||||
target_path=str(self.target_path),
|
||||
parameters=config or {}
|
||||
)
|
||||
|
||||
print(f" Started run: {run.run_id}")
|
||||
|
||||
# Wait for completion (up to 30 minutes for slow LLMs)
|
||||
print(f" Waiting for completion...")
|
||||
result = self.client.wait_for_completion(run.run_id, timeout=1800)
|
||||
|
||||
execution_time = time.time() - start_time
|
||||
|
||||
if result.status != "COMPLETED":
|
||||
print(f"❌ {tool_name} workflow failed: {result.status}")
|
||||
return None
|
||||
|
||||
# Get findings from SARIF
|
||||
findings = self.client.get_run_findings(run.run_id)
|
||||
|
||||
if not findings or not findings.sarif:
|
||||
print(f"⚠️ {tool_name} produced no findings")
|
||||
return None
|
||||
|
||||
# Extract results from SARIF and group by file
|
||||
findings_by_file = {}
|
||||
unique_locations = set()
|
||||
|
||||
for run_data in findings.sarif.get("runs", []):
|
||||
for result in run_data.get("results", []):
|
||||
locations = result.get("locations", [])
|
||||
for location in locations:
|
||||
physical_location = location.get("physicalLocation", {})
|
||||
artifact_location = physical_location.get("artifactLocation", {})
|
||||
region = physical_location.get("region", {})
|
||||
|
||||
uri = artifact_location.get("uri", "")
|
||||
line = region.get("startLine", 0)
|
||||
|
||||
if uri and line:
|
||||
if uri not in findings_by_file:
|
||||
findings_by_file[uri] = []
|
||||
findings_by_file[uri].append(line)
|
||||
unique_locations.add((uri, line))
|
||||
|
||||
# Sort line numbers for each file
|
||||
for file_path in findings_by_file:
|
||||
findings_by_file[file_path] = sorted(set(findings_by_file[file_path]))
|
||||
|
||||
# Calculate file type distribution
|
||||
file_types = {}
|
||||
for file_path in findings_by_file:
|
||||
ext = Path(file_path).suffix or Path(file_path).name # Use full name for files like .env
|
||||
if ext.startswith('.'):
|
||||
file_types[ext] = file_types.get(ext, 0) + 1
|
||||
else:
|
||||
file_types['[no extension]'] = file_types.get('[no extension]', 0) + 1
|
||||
|
||||
# Calculate secret density
|
||||
secret_density = len(unique_locations) / len(findings_by_file) if findings_by_file else 0
|
||||
|
||||
print(f" ✓ Found {len(unique_locations)} secrets in {len(findings_by_file)} files (avg {secret_density:.1f} per file)")
|
||||
|
||||
return ToolResult(
|
||||
tool_name=tool_name,
|
||||
execution_time=execution_time,
|
||||
findings_count=len(unique_locations),
|
||||
findings_by_file=findings_by_file,
|
||||
unique_files=len(findings_by_file),
|
||||
unique_locations=len(unique_locations),
|
||||
secret_density=secret_density,
|
||||
file_types=file_types
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ {tool_name} error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def run_all_tools(self, llm_models: List[str] = None) -> List[ToolResult]:
|
||||
"""Run all available tools"""
|
||||
results = []
|
||||
|
||||
if llm_models is None:
|
||||
llm_models = ["gpt-4o-mini"]
|
||||
|
||||
# Gitleaks
|
||||
result = await self.run_workflow("gitleaks_detection", "Gitleaks", {
|
||||
"scan_mode": "detect",
|
||||
"no_git": True,
|
||||
"redact": False
|
||||
})
|
||||
if result:
|
||||
results.append(result)
|
||||
|
||||
# TruffleHog
|
||||
result = await self.run_workflow("trufflehog_detection", "TruffleHog", {
|
||||
"verify": False,
|
||||
"max_depth": 10
|
||||
})
|
||||
if result:
|
||||
results.append(result)
|
||||
|
||||
# LLM Detector with multiple models
|
||||
for model in llm_models:
|
||||
tool_name = f"LLM ({model})"
|
||||
result = await self.run_workflow("llm_secret_detection", tool_name, {
|
||||
"agent_url": "http://fuzzforge-task-agent:8000/a2a/litellm_agent",
|
||||
"llm_model": model,
|
||||
"llm_provider": "openai" if "gpt" in model else "anthropic",
|
||||
"max_files": 20,
|
||||
"timeout": 60,
|
||||
"file_patterns": [
|
||||
"*.py", "*.js", "*.ts", "*.java", "*.go", "*.env", "*.yaml", "*.yml",
|
||||
"*.json", "*.xml", "*.ini", "*.sql", "*.properties", "*.sh", "*.bat",
|
||||
"*.config", "*.conf", "*.toml", "*id_rsa*", "*.txt"
|
||||
]
|
||||
})
|
||||
if result:
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
def _calculate_agreement_matrix(self, results: List[ToolResult]) -> Dict[str, Dict[str, int]]:
|
||||
"""Calculate overlap matrix showing common secrets between tool pairs"""
|
||||
matrix = {}
|
||||
|
||||
for i, result1 in enumerate(results):
|
||||
matrix[result1.tool_name] = {}
|
||||
# Convert to set of (file, line) tuples
|
||||
secrets1 = set()
|
||||
for file_path, lines in result1.findings_by_file.items():
|
||||
for line in lines:
|
||||
secrets1.add((file_path, line))
|
||||
|
||||
for result2 in results:
|
||||
secrets2 = set()
|
||||
for file_path, lines in result2.findings_by_file.items():
|
||||
for line in lines:
|
||||
secrets2.add((file_path, line))
|
||||
|
||||
# Count common secrets
|
||||
common = len(secrets1 & secrets2)
|
||||
matrix[result1.tool_name][result2.tool_name] = common
|
||||
|
||||
return matrix
|
||||
|
||||
def _get_per_file_comparison(self, results: List[ToolResult]) -> Dict[str, Dict[str, int]]:
|
||||
"""Get per-file breakdown of findings across all tools"""
|
||||
all_files = set()
|
||||
for result in results:
|
||||
all_files.update(result.findings_by_file.keys())
|
||||
|
||||
comparison = {}
|
||||
for file_path in sorted(all_files):
|
||||
comparison[file_path] = {}
|
||||
for result in results:
|
||||
comparison[file_path][result.tool_name] = len(result.findings_by_file.get(file_path, []))
|
||||
|
||||
return comparison
|
||||
|
||||
def _get_agreement_stats(self, results: List[ToolResult]) -> Dict[int, int]:
|
||||
"""Calculate how many secrets are found by 1, 2, 3, or all tools"""
|
||||
# Collect all unique (file, line) pairs across all tools
|
||||
all_secrets = {} # (file, line) -> list of tools that found it
|
||||
|
||||
for result in results:
|
||||
for file_path, lines in result.findings_by_file.items():
|
||||
for line in lines:
|
||||
key = (file_path, line)
|
||||
if key not in all_secrets:
|
||||
all_secrets[key] = []
|
||||
all_secrets[key].append(result.tool_name)
|
||||
|
||||
# Count by number of tools
|
||||
agreement_counts = {}
|
||||
for secret, tools in all_secrets.items():
|
||||
count = len(set(tools)) # Unique tools
|
||||
agreement_counts[count] = agreement_counts.get(count, 0) + 1
|
||||
|
||||
return agreement_counts
|
||||
|
||||
def generate_markdown_report(self, results: List[ToolResult]) -> str:
|
||||
"""Generate markdown comparison report"""
|
||||
report = []
|
||||
report.append("# Secret Detection Tools Comparison\n")
|
||||
report.append(f"**Target**: {self.target_path.name}")
|
||||
report.append(f"**Tools**: {', '.join([r.tool_name for r in results])}\n")
|
||||
|
||||
# Summary table with extended metrics
|
||||
report.append("\n## Summary\n")
|
||||
report.append("| Tool | Secrets | Files | Avg/File | Time (s) |")
|
||||
report.append("|------|---------|-------|----------|----------|")
|
||||
|
||||
for result in results:
|
||||
report.append(
|
||||
f"| {result.tool_name} | "
|
||||
f"{result.findings_count} | "
|
||||
f"{result.unique_files} | "
|
||||
f"{result.secret_density:.1f} | "
|
||||
f"{result.execution_time:.2f} |"
|
||||
)
|
||||
|
||||
# Agreement Analysis
|
||||
agreement_stats = self._get_agreement_stats(results)
|
||||
report.append("\n## Agreement Analysis\n")
|
||||
report.append("Secrets found by different numbers of tools:\n")
|
||||
for num_tools in sorted(agreement_stats.keys(), reverse=True):
|
||||
count = agreement_stats[num_tools]
|
||||
if num_tools == len(results):
|
||||
report.append(f"- **All {num_tools} tools agree**: {count} secrets")
|
||||
elif num_tools == 1:
|
||||
report.append(f"- **Only 1 tool found**: {count} secrets")
|
||||
else:
|
||||
report.append(f"- **{num_tools} tools agree**: {count} secrets")
|
||||
|
||||
# Agreement Matrix
|
||||
agreement_matrix = self._calculate_agreement_matrix(results)
|
||||
report.append("\n## Tool Agreement Matrix\n")
|
||||
report.append("Number of common secrets found by tool pairs:\n")
|
||||
|
||||
# Header row
|
||||
header = "| Tool |"
|
||||
separator = "|------|"
|
||||
for result in results:
|
||||
short_name = result.tool_name.replace("LLM (", "").replace(")", "")
|
||||
header += f" {short_name} |"
|
||||
separator += "------|"
|
||||
report.append(header)
|
||||
report.append(separator)
|
||||
|
||||
# Data rows
|
||||
for result in results:
|
||||
short_name = result.tool_name.replace("LLM (", "").replace(")", "")
|
||||
row = f"| {short_name} |"
|
||||
for result2 in results:
|
||||
count = agreement_matrix[result.tool_name][result2.tool_name]
|
||||
row += f" {count} |"
|
||||
report.append(row)
|
||||
|
||||
# Per-File Comparison
|
||||
per_file = self._get_per_file_comparison(results)
|
||||
report.append("\n## Per-File Detailed Comparison\n")
|
||||
report.append("Secrets found per file by each tool:\n")
|
||||
|
||||
# Header
|
||||
header = "| File |"
|
||||
separator = "|------|"
|
||||
for result in results:
|
||||
short_name = result.tool_name.replace("LLM (", "").replace(")", "")
|
||||
header += f" {short_name} |"
|
||||
separator += "------|"
|
||||
header += " Total |"
|
||||
separator += "------|"
|
||||
report.append(header)
|
||||
report.append(separator)
|
||||
|
||||
# Show top 15 files by total findings
|
||||
file_totals = [(f, sum(counts.values())) for f, counts in per_file.items()]
|
||||
file_totals.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
for file_path, total in file_totals[:15]:
|
||||
row = f"| `{file_path}` |"
|
||||
for result in results:
|
||||
count = per_file[file_path].get(result.tool_name, 0)
|
||||
row += f" {count} |"
|
||||
row += f" **{total}** |"
|
||||
report.append(row)
|
||||
|
||||
if len(file_totals) > 15:
|
||||
report.append(f"| ... and {len(file_totals) - 15} more files | ... | ... | ... | ... | ... |")
|
||||
|
||||
# File Type Breakdown
|
||||
report.append("\n## File Type Breakdown\n")
|
||||
all_extensions = set()
|
||||
for result in results:
|
||||
all_extensions.update(result.file_types.keys())
|
||||
|
||||
if all_extensions:
|
||||
header = "| Type |"
|
||||
separator = "|------|"
|
||||
for result in results:
|
||||
short_name = result.tool_name.replace("LLM (", "").replace(")", "")
|
||||
header += f" {short_name} |"
|
||||
separator += "------|"
|
||||
report.append(header)
|
||||
report.append(separator)
|
||||
|
||||
for ext in sorted(all_extensions):
|
||||
row = f"| `{ext}` |"
|
||||
for result in results:
|
||||
count = result.file_types.get(ext, 0)
|
||||
row += f" {count} files |"
|
||||
report.append(row)
|
||||
|
||||
# File analysis
|
||||
report.append("\n## Files Analyzed\n")
|
||||
|
||||
# Collect all unique files across all tools
|
||||
all_files = set()
|
||||
for result in results:
|
||||
all_files.update(result.findings_by_file.keys())
|
||||
|
||||
report.append(f"**Total unique files with secrets**: {len(all_files)}\n")
|
||||
|
||||
for result in results:
|
||||
report.append(f"\n### {result.tool_name}\n")
|
||||
report.append(f"Found secrets in **{result.unique_files} files**:\n")
|
||||
|
||||
# Sort files by number of findings (descending)
|
||||
sorted_files = sorted(
|
||||
result.findings_by_file.items(),
|
||||
key=lambda x: len(x[1]),
|
||||
reverse=True
|
||||
)
|
||||
|
||||
# Show top 10 files
|
||||
for file_path, lines in sorted_files[:10]:
|
||||
report.append(f"- `{file_path}`: {len(lines)} secrets (lines: {', '.join(map(str, lines[:5]))}{'...' if len(lines) > 5 else ''})")
|
||||
|
||||
if len(sorted_files) > 10:
|
||||
report.append(f"- ... and {len(sorted_files) - 10} more files")
|
||||
|
||||
# Overlap analysis
|
||||
if len(results) >= 2:
|
||||
report.append("\n## Overlap Analysis\n")
|
||||
|
||||
# Find common files
|
||||
file_sets = [set(r.findings_by_file.keys()) for r in results]
|
||||
common_files = set.intersection(*file_sets) if file_sets else set()
|
||||
|
||||
if common_files:
|
||||
report.append(f"\n**Files found by all tools** ({len(common_files)}):\n")
|
||||
for file_path in sorted(common_files)[:10]:
|
||||
report.append(f"- `{file_path}`")
|
||||
else:
|
||||
report.append("\n**No files were found by all tools**\n")
|
||||
|
||||
# Find tool-specific files
|
||||
for i, result in enumerate(results):
|
||||
unique_to_tool = set(result.findings_by_file.keys())
|
||||
for j, other_result in enumerate(results):
|
||||
if i != j:
|
||||
unique_to_tool -= set(other_result.findings_by_file.keys())
|
||||
|
||||
if unique_to_tool:
|
||||
report.append(f"\n**Unique to {result.tool_name}** ({len(unique_to_tool)} files):\n")
|
||||
for file_path in sorted(unique_to_tool)[:5]:
|
||||
report.append(f"- `{file_path}`")
|
||||
if len(unique_to_tool) > 5:
|
||||
report.append(f"- ... and {len(unique_to_tool) - 5} more")
|
||||
|
||||
# Ground Truth Analysis (if available)
|
||||
ground_truth_path = Path(__file__).parent / "secret_detection_benchmark_GROUND_TRUTH.json"
|
||||
if ground_truth_path.exists():
|
||||
report.append("\n## Ground Truth Analysis\n")
|
||||
try:
|
||||
with open(ground_truth_path) as f:
|
||||
gt_data = json.load(f)
|
||||
|
||||
gt_total = gt_data.get("total_secrets", 30)
|
||||
report.append(f"**Expected secrets**: {gt_total} (documented in ground truth)\n")
|
||||
|
||||
# Build ground truth set of (file, line) tuples
|
||||
gt_secrets = set()
|
||||
for secret in gt_data.get("secrets", []):
|
||||
gt_secrets.add((secret["file"], secret["line"]))
|
||||
|
||||
report.append("### Tool Performance vs Ground Truth\n")
|
||||
report.append("| Tool | Found | Expected | Recall | Extra Findings |")
|
||||
report.append("|------|-------|----------|--------|----------------|")
|
||||
|
||||
for result in results:
|
||||
# Build tool findings set
|
||||
tool_secrets = set()
|
||||
for file_path, lines in result.findings_by_file.items():
|
||||
for line in lines:
|
||||
tool_secrets.add((file_path, line))
|
||||
|
||||
# Calculate metrics
|
||||
true_positives = len(gt_secrets & tool_secrets)
|
||||
recall = (true_positives / gt_total * 100) if gt_total > 0 else 0
|
||||
extra = len(tool_secrets - gt_secrets)
|
||||
|
||||
report.append(
|
||||
f"| {result.tool_name} | "
|
||||
f"{result.findings_count} | "
|
||||
f"{gt_total} | "
|
||||
f"{recall:.1f}% | "
|
||||
f"{extra} |"
|
||||
)
|
||||
|
||||
# Analyze LLM extra findings
|
||||
llm_results = [r for r in results if "LLM" in r.tool_name]
|
||||
if llm_results:
|
||||
report.append("\n### LLM Extra Findings Explanation\n")
|
||||
report.append("LLMs may find more than 30 secrets because they detect:\n")
|
||||
report.append("- **Split secret components**: Each part of `DB_PASS_PART1 + PART2 + PART3` counted separately")
|
||||
report.append("- **Join operations**: Lines like `''.join(AWS_SECRET_CHARS)` flagged as additional exposure")
|
||||
report.append("- **Decoding functions**: Code that reveals secrets (e.g., `base64.b64decode()`, `codecs.decode()`)")
|
||||
report.append("- **Comment identifiers**: Lines marking secret locations without plaintext values")
|
||||
report.append("\nThese are *technically correct* detections of secret exposure points, not false positives.")
|
||||
report.append("The ground truth documents 30 'primary' secrets, but the codebase has additional derivative exposures.\n")
|
||||
|
||||
except Exception as e:
|
||||
report.append(f"*Could not load ground truth: {e}*\n")
|
||||
|
||||
# Performance summary
|
||||
if results:
|
||||
report.append("\n## Performance Summary\n")
|
||||
most_findings = max(results, key=lambda r: r.findings_count)
|
||||
most_files = max(results, key=lambda r: r.unique_files)
|
||||
fastest = min(results, key=lambda r: r.execution_time)
|
||||
|
||||
report.append(f"- **Most secrets found**: {most_findings.tool_name} ({most_findings.findings_count} secrets)")
|
||||
report.append(f"- **Most files covered**: {most_files.tool_name} ({most_files.unique_files} files)")
|
||||
report.append(f"- **Fastest**: {fastest.tool_name} ({fastest.execution_time:.2f}s)")
|
||||
|
||||
return "\n".join(report)
|
||||
|
||||
def save_json_report(self, results: List[ToolResult], output_path: Path):
|
||||
"""Save results as JSON"""
|
||||
data = {
|
||||
"target_path": str(self.target_path),
|
||||
"results": [asdict(r) for r in results]
|
||||
}
|
||||
|
||||
with open(output_path, 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
print(f"\n✅ JSON report saved to: {output_path}")
|
||||
|
||||
def cleanup(self):
|
||||
"""Cleanup SDK client"""
|
||||
self.client.close()
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run comparison and generate reports"""
|
||||
# Get target path (secret_detection_benchmark)
|
||||
target_path = Path(__file__).parent.parent.parent.parent.parent / "test_projects" / "secret_detection_benchmark"
|
||||
|
||||
if not target_path.exists():
|
||||
print(f"❌ Target not found at: {target_path}")
|
||||
return 1
|
||||
|
||||
print("=" * 80)
|
||||
print("Secret Detection Tools Comparison")
|
||||
print("=" * 80)
|
||||
print(f"Target: {target_path}")
|
||||
|
||||
# LLM models to test
|
||||
llm_models = [
|
||||
"gpt-4o-mini",
|
||||
"gpt-5-mini"
|
||||
]
|
||||
print(f"LLM models: {', '.join(llm_models)}\n")
|
||||
|
||||
# Run comparison
|
||||
comparison = SecretDetectionComparison(target_path)
|
||||
|
||||
try:
|
||||
results = await comparison.run_all_tools(llm_models=llm_models)
|
||||
|
||||
if not results:
|
||||
print("❌ No tools ran successfully")
|
||||
return 1
|
||||
|
||||
# Generate reports
|
||||
print("\n" + "=" * 80)
|
||||
markdown_report = comparison.generate_markdown_report(results)
|
||||
print(markdown_report)
|
||||
|
||||
# Save reports
|
||||
output_dir = Path(__file__).parent / "results"
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
markdown_path = output_dir / "comparison_report.md"
|
||||
with open(markdown_path, 'w') as f:
|
||||
f.write(markdown_report)
|
||||
print(f"\n✅ Markdown report saved to: {markdown_path}")
|
||||
|
||||
json_path = output_dir / "comparison_results.json"
|
||||
comparison.save_json_report(results, json_path)
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("✅ Comparison complete!")
|
||||
print("=" * 80)
|
||||
|
||||
return 0
|
||||
|
||||
finally:
|
||||
comparison.cleanup()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit_code = asyncio.run(main())
|
||||
sys.exit(exit_code)
|
||||
Reference in New Issue
Block a user