Files
fuzzforge_ai/backend/benchmarks
tduhamel42 54738ca091 fix: Add benchmark results files to git
- Added exception in .gitignore for benchmark results directory
- Force-added comparison_report.md and comparison_results.json
- These files contain benchmark metrics, not actual secrets
- Fixes broken link in README to benchmark results
2025-10-17 10:02:39 +02:00
..

FuzzForge Benchmark Suite

Performance benchmarking infrastructure organized by module category.

Directory Structure

benchmarks/
├── conftest.py              # Benchmark fixtures
├── category_configs.py      # Category-specific thresholds
├── by_category/             # Benchmarks organized by category
│   ├── fuzzer/
│   │   ├── bench_cargo_fuzz.py
│   │   └── bench_atheris.py
│   ├── scanner/
│   │   └── bench_file_scanner.py
│   ├── secret_detection/
│   │   ├── bench_gitleaks.py
│   │   └── bench_trufflehog.py
│   └── analyzer/
│       └── bench_security_analyzer.py
├── fixtures/                # Benchmark test data
│   ├── small/               # ~1K LOC
│   ├── medium/              # ~10K LOC
│   └── large/               # ~100K LOC
└── results/                 # Benchmark results (JSON)

Module Categories

Fuzzer

Expected Metrics: execs/sec, coverage_rate, time_to_crash, memory_usage

Performance Thresholds:

  • Min 1000 execs/sec
  • Max 10s for small projects
  • Max 2GB memory

Scanner

Expected Metrics: files/sec, LOC/sec, findings_count

Performance Thresholds:

  • Min 100 files/sec
  • Min 10K LOC/sec
  • Max 512MB memory

Secret Detection

Expected Metrics: patterns/sec, precision, recall, F1

Performance Thresholds:

  • Min 90% precision
  • Min 95% recall
  • Max 5 false positives per 100 secrets

Analyzer

Expected Metrics: analysis_depth, files/sec, accuracy

Performance Thresholds:

  • Min 10 files/sec (deep analysis)
  • Min 85% accuracy
  • Max 2GB memory

Running Benchmarks

All Benchmarks

cd backend
pytest benchmarks/ --benchmark-only -v

Specific Category

pytest benchmarks/by_category/fuzzer/ --benchmark-only -v

With Comparison

# Run and save baseline
pytest benchmarks/ --benchmark-only --benchmark-save=baseline

# Compare against baseline
pytest benchmarks/ --benchmark-only --benchmark-compare=baseline

Generate Histogram

pytest benchmarks/ --benchmark-only --benchmark-histogram=histogram

Benchmark Results

Results are saved as JSON and include:

  • Mean execution time
  • Standard deviation
  • Min/Max values
  • Iterations per second
  • Memory usage

Example output:

------------------------ benchmark: fuzzer --------------------------
Name                                Mean      StdDev    Ops/Sec
bench_cargo_fuzz[discovery]        0.0012s   0.0001s   833.33
bench_cargo_fuzz[execution]        0.1250s   0.0050s     8.00
bench_cargo_fuzz[memory]           0.0100s   0.0005s   100.00
---------------------------------------------------------------------

CI/CD Integration

Benchmarks run:

  • Nightly: Full benchmark suite, track trends
  • On PR: When benchmarks/ or modules/ changed
  • Manual: Via workflow_dispatch

Regression Detection

Benchmarks automatically fail if:

  • Performance degrades >10%
  • Memory usage exceeds thresholds
  • Throughput drops below minimum

See .github/workflows/benchmark.yml for configuration.

Adding New Benchmarks

1. Create benchmark file in category directory

# benchmarks/by_category/fuzzer/bench_new_fuzzer.py

import pytest
from benchmarks.category_configs import ModuleCategory, get_threshold

@pytest.mark.benchmark(group="fuzzer")
def test_execution_performance(benchmark, new_fuzzer, test_workspace):
    """Benchmark execution speed"""
    result = benchmark(new_fuzzer.execute, config, test_workspace)

    # Validate against threshold
    threshold = get_threshold(ModuleCategory.FUZZER, "max_execution_time_small")
    assert result.execution_time < threshold

2. Update category_configs.py if needed

Add new thresholds or metrics for your module.

3. Run locally

pytest benchmarks/by_category/fuzzer/bench_new_fuzzer.py --benchmark-only -v

Best Practices

  1. Use mocking for external dependencies (network, disk I/O)
  2. Fixed iterations for consistent benchmarking
  3. Warm-up runs for JIT-compiled code
  4. Category-specific metrics aligned with module purpose
  5. Realistic fixtures that represent actual use cases
  6. Memory profiling using tracemalloc
  7. Compare apples to apples within the same category

Interpreting Results

Good Performance

  • Execution time below threshold
  • Memory usage within limits
  • Throughput meets minimum
  • <5% variance across runs

Performance Issues

  • ⚠️ Execution time 10-20% over threshold
  • Execution time >20% over threshold
  • Memory leaks (increasing over iterations)
  • High variance (>10%) indicates instability

Tracking Performance Over Time

Benchmark results are stored as artifacts with:

  • Commit SHA
  • Timestamp
  • Environment details (Python version, OS)
  • Full metrics

Use these to track long-term performance trends and detect gradual degradation.