mirror of
https://github.com/FuzzingLabs/fuzzforge_ai.git
synced 2026-02-12 20:32:46 +00:00
- Added exception in .gitignore for benchmark results directory - Force-added comparison_report.md and comparison_results.json - These files contain benchmark metrics, not actual secrets - Fixes broken link in README to benchmark results
FuzzForge Benchmark Suite
Performance benchmarking infrastructure organized by module category.
Directory Structure
benchmarks/
├── conftest.py # Benchmark fixtures
├── category_configs.py # Category-specific thresholds
├── by_category/ # Benchmarks organized by category
│ ├── fuzzer/
│ │ ├── bench_cargo_fuzz.py
│ │ └── bench_atheris.py
│ ├── scanner/
│ │ └── bench_file_scanner.py
│ ├── secret_detection/
│ │ ├── bench_gitleaks.py
│ │ └── bench_trufflehog.py
│ └── analyzer/
│ └── bench_security_analyzer.py
├── fixtures/ # Benchmark test data
│ ├── small/ # ~1K LOC
│ ├── medium/ # ~10K LOC
│ └── large/ # ~100K LOC
└── results/ # Benchmark results (JSON)
Module Categories
Fuzzer
Expected Metrics: execs/sec, coverage_rate, time_to_crash, memory_usage
Performance Thresholds:
- Min 1000 execs/sec
- Max 10s for small projects
- Max 2GB memory
Scanner
Expected Metrics: files/sec, LOC/sec, findings_count
Performance Thresholds:
- Min 100 files/sec
- Min 10K LOC/sec
- Max 512MB memory
Secret Detection
Expected Metrics: patterns/sec, precision, recall, F1
Performance Thresholds:
- Min 90% precision
- Min 95% recall
- Max 5 false positives per 100 secrets
Analyzer
Expected Metrics: analysis_depth, files/sec, accuracy
Performance Thresholds:
- Min 10 files/sec (deep analysis)
- Min 85% accuracy
- Max 2GB memory
Running Benchmarks
All Benchmarks
cd backend
pytest benchmarks/ --benchmark-only -v
Specific Category
pytest benchmarks/by_category/fuzzer/ --benchmark-only -v
With Comparison
# Run and save baseline
pytest benchmarks/ --benchmark-only --benchmark-save=baseline
# Compare against baseline
pytest benchmarks/ --benchmark-only --benchmark-compare=baseline
Generate Histogram
pytest benchmarks/ --benchmark-only --benchmark-histogram=histogram
Benchmark Results
Results are saved as JSON and include:
- Mean execution time
- Standard deviation
- Min/Max values
- Iterations per second
- Memory usage
Example output:
------------------------ benchmark: fuzzer --------------------------
Name Mean StdDev Ops/Sec
bench_cargo_fuzz[discovery] 0.0012s 0.0001s 833.33
bench_cargo_fuzz[execution] 0.1250s 0.0050s 8.00
bench_cargo_fuzz[memory] 0.0100s 0.0005s 100.00
---------------------------------------------------------------------
CI/CD Integration
Benchmarks run:
- Nightly: Full benchmark suite, track trends
- On PR: When benchmarks/ or modules/ changed
- Manual: Via workflow_dispatch
Regression Detection
Benchmarks automatically fail if:
- Performance degrades >10%
- Memory usage exceeds thresholds
- Throughput drops below minimum
See .github/workflows/benchmark.yml for configuration.
Adding New Benchmarks
1. Create benchmark file in category directory
# benchmarks/by_category/fuzzer/bench_new_fuzzer.py
import pytest
from benchmarks.category_configs import ModuleCategory, get_threshold
@pytest.mark.benchmark(group="fuzzer")
def test_execution_performance(benchmark, new_fuzzer, test_workspace):
"""Benchmark execution speed"""
result = benchmark(new_fuzzer.execute, config, test_workspace)
# Validate against threshold
threshold = get_threshold(ModuleCategory.FUZZER, "max_execution_time_small")
assert result.execution_time < threshold
2. Update category_configs.py if needed
Add new thresholds or metrics for your module.
3. Run locally
pytest benchmarks/by_category/fuzzer/bench_new_fuzzer.py --benchmark-only -v
Best Practices
- Use mocking for external dependencies (network, disk I/O)
- Fixed iterations for consistent benchmarking
- Warm-up runs for JIT-compiled code
- Category-specific metrics aligned with module purpose
- Realistic fixtures that represent actual use cases
- Memory profiling using tracemalloc
- Compare apples to apples within the same category
Interpreting Results
Good Performance
- ✅ Execution time below threshold
- ✅ Memory usage within limits
- ✅ Throughput meets minimum
- ✅ <5% variance across runs
Performance Issues
- ⚠️ Execution time 10-20% over threshold
- ❌ Execution time >20% over threshold
- ❌ Memory leaks (increasing over iterations)
- ❌ High variance (>10%) indicates instability
Tracking Performance Over Time
Benchmark results are stored as artifacts with:
- Commit SHA
- Timestamp
- Environment details (Python version, OS)
- Full metrics
Use these to track long-term performance trends and detect gradual degradation.