# FuzzForge Benchmark Suite Performance benchmarking infrastructure organized by module category. ## Directory Structure ``` benchmarks/ ├── conftest.py # Benchmark fixtures ├── category_configs.py # Category-specific thresholds ├── by_category/ # Benchmarks organized by category │ ├── fuzzer/ │ │ ├── bench_cargo_fuzz.py │ │ └── bench_atheris.py │ ├── scanner/ │ │ └── bench_file_scanner.py │ ├── secret_detection/ │ │ ├── bench_gitleaks.py │ │ └── bench_trufflehog.py │ └── analyzer/ │ └── bench_security_analyzer.py ├── fixtures/ # Benchmark test data │ ├── small/ # ~1K LOC │ ├── medium/ # ~10K LOC │ └── large/ # ~100K LOC └── results/ # Benchmark results (JSON) ``` ## Module Categories ### Fuzzer **Expected Metrics**: execs/sec, coverage_rate, time_to_crash, memory_usage **Performance Thresholds**: - Min 1000 execs/sec - Max 10s for small projects - Max 2GB memory ### Scanner **Expected Metrics**: files/sec, LOC/sec, findings_count **Performance Thresholds**: - Min 100 files/sec - Min 10K LOC/sec - Max 512MB memory ### Secret Detection **Expected Metrics**: patterns/sec, precision, recall, F1 **Performance Thresholds**: - Min 90% precision - Min 95% recall - Max 5 false positives per 100 secrets ### Analyzer **Expected Metrics**: analysis_depth, files/sec, accuracy **Performance Thresholds**: - Min 10 files/sec (deep analysis) - Min 85% accuracy - Max 2GB memory ## Running Benchmarks ### All Benchmarks ```bash cd backend pytest benchmarks/ --benchmark-only -v ``` ### Specific Category ```bash pytest benchmarks/by_category/fuzzer/ --benchmark-only -v ``` ### With Comparison ```bash # Run and save baseline pytest benchmarks/ --benchmark-only --benchmark-save=baseline # Compare against baseline pytest benchmarks/ --benchmark-only --benchmark-compare=baseline ``` ### Generate Histogram ```bash pytest benchmarks/ --benchmark-only --benchmark-histogram=histogram ``` ## Benchmark Results Results are saved as JSON and include: - Mean execution time - Standard deviation - Min/Max values - Iterations per second - Memory usage Example output: ``` ------------------------ benchmark: fuzzer -------------------------- Name Mean StdDev Ops/Sec bench_cargo_fuzz[discovery] 0.0012s 0.0001s 833.33 bench_cargo_fuzz[execution] 0.1250s 0.0050s 8.00 bench_cargo_fuzz[memory] 0.0100s 0.0005s 100.00 --------------------------------------------------------------------- ``` ## CI/CD Integration Benchmarks run: - **Nightly**: Full benchmark suite, track trends - **On PR**: When benchmarks/ or modules/ changed - **Manual**: Via workflow_dispatch ### Regression Detection Benchmarks automatically fail if: - Performance degrades >10% - Memory usage exceeds thresholds - Throughput drops below minimum See `.github/workflows/benchmark.yml` for configuration. ## Adding New Benchmarks ### 1. Create benchmark file in category directory ```python # benchmarks/by_category/fuzzer/bench_new_fuzzer.py import pytest from benchmarks.category_configs import ModuleCategory, get_threshold @pytest.mark.benchmark(group="fuzzer") def test_execution_performance(benchmark, new_fuzzer, test_workspace): """Benchmark execution speed""" result = benchmark(new_fuzzer.execute, config, test_workspace) # Validate against threshold threshold = get_threshold(ModuleCategory.FUZZER, "max_execution_time_small") assert result.execution_time < threshold ``` ### 2. Update category_configs.py if needed Add new thresholds or metrics for your module. ### 3. Run locally ```bash pytest benchmarks/by_category/fuzzer/bench_new_fuzzer.py --benchmark-only -v ``` ## Best Practices 1. **Use mocking** for external dependencies (network, disk I/O) 2. **Fixed iterations** for consistent benchmarking 3. **Warm-up runs** for JIT-compiled code 4. **Category-specific metrics** aligned with module purpose 5. **Realistic fixtures** that represent actual use cases 6. **Memory profiling** using tracemalloc 7. **Compare apples to apples** within the same category ## Interpreting Results ### Good Performance - ✅ Execution time below threshold - ✅ Memory usage within limits - ✅ Throughput meets minimum - ✅ <5% variance across runs ### Performance Issues - ⚠️ Execution time 10-20% over threshold - ❌ Execution time >20% over threshold - ❌ Memory leaks (increasing over iterations) - ❌ High variance (>10%) indicates instability ## Tracking Performance Over Time Benchmark results are stored as artifacts with: - Commit SHA - Timestamp - Environment details (Python version, OS) - Full metrics Use these to track long-term performance trends and detect gradual degradation.