Files
fuzzforge_ai/ARCHITECTURE.md
tduhamel42 60ca088ecf CI/CD Integration with Ephemeral Deployment Model (#14)
* feat: Complete migration from Prefect to Temporal

BREAKING CHANGE: Replaces Prefect workflow orchestration with Temporal

## Major Changes
- Replace Prefect with Temporal for workflow orchestration
- Implement vertical worker architecture (rust, android)
- Replace Docker registry with MinIO for unified storage
- Refactor activities to be co-located with workflows
- Update all API endpoints for Temporal compatibility

## Infrastructure
- New: docker-compose.temporal.yaml (Temporal + MinIO + workers)
- New: workers/ directory with rust and android vertical workers
- New: backend/src/temporal/ (manager, discovery)
- New: backend/src/storage/ (S3-cached storage with MinIO)
- New: backend/toolbox/common/ (shared storage activities)
- Deleted: docker-compose.yaml (old Prefect setup)
- Deleted: backend/src/core/prefect_manager.py
- Deleted: backend/src/services/prefect_stats_monitor.py
- Deleted: Docker registry and insecure-registries requirement

## Workflows
- Migrated: security_assessment workflow to Temporal
- New: rust_test workflow (example/test workflow)
- Deleted: secret_detection_scan (Prefect-based, to be reimplemented)
- Activities now co-located with workflows for independent testing

## API Changes
- Updated: backend/src/api/workflows.py (Temporal submission)
- Updated: backend/src/api/runs.py (Temporal status/results)
- Updated: backend/src/main.py (727 lines, TemporalManager integration)
- Updated: All 16 MCP tools to use TemporalManager

## Testing
-  All services healthy (Temporal, PostgreSQL, MinIO, workers, backend)
-  All API endpoints functional
-  End-to-end workflow test passed (72 findings from vulnerable_app)
-  MinIO storage integration working (target upload/download, results)
-  Worker activity discovery working (6 activities registered)
-  Tarball extraction working
-  SARIF report generation working

## Documentation
- ARCHITECTURE.md: Complete Temporal architecture documentation
- QUICKSTART_TEMPORAL.md: Getting started guide
- MIGRATION_DECISION.md: Why we chose Temporal over Prefect
- IMPLEMENTATION_STATUS.md: Migration progress tracking
- workers/README.md: Worker development guide

## Dependencies
- Added: temporalio>=1.6.0
- Added: boto3>=1.34.0 (MinIO S3 client)
- Removed: prefect>=3.4.18

* feat: Add Python fuzzing vertical with Atheris integration

This commit implements a complete Python fuzzing workflow using Atheris:

## Python Worker (workers/python/)
- Dockerfile with Python 3.11, Atheris, and build tools
- Generic worker.py for dynamic workflow discovery
- requirements.txt with temporalio, boto3, atheris dependencies
- Added to docker-compose.temporal.yaml with dedicated cache volume

## AtherisFuzzer Module (backend/toolbox/modules/fuzzer/)
- Reusable module extending BaseModule
- Auto-discovers fuzz targets (fuzz_*.py, *_fuzz.py, fuzz_target.py)
- Recursive search to find targets in nested directories
- Dynamically loads TestOneInput() function
- Configurable max_iterations and timeout
- Real-time stats callback support for live monitoring
- Returns findings as ModuleFinding objects

## Atheris Fuzzing Workflow (backend/toolbox/workflows/atheris_fuzzing/)
- Temporal workflow for orchestrating fuzzing
- Downloads user code from MinIO
- Executes AtherisFuzzer module
- Uploads results to MinIO
- Cleans up cache after execution
- metadata.yaml with vertical: python for routing

## Test Project (test_projects/python_fuzz_waterfall/)
- Demonstrates stateful waterfall vulnerability
- main.py with check_secret() that leaks progress
- fuzz_target.py with Atheris TestOneInput() harness
- Complete README with usage instructions

## Backend Fixes
- Fixed parameter merging in REST API endpoints (workflows.py)
- Changed workflow parameter passing from positional args to kwargs (manager.py)
- Default parameters now properly merged with user parameters

## Testing
 Worker discovered AtherisFuzzingWorkflow
 Workflow executed end-to-end successfully
 Fuzz target auto-discovered in nested directories
 Atheris ran 100,000 iterations
 Results uploaded and cache cleaned

* chore: Complete Temporal migration with updated CLI/SDK/docs

This commit includes all remaining Temporal migration changes:

## CLI Updates (cli/)
- Updated workflow execution commands for Temporal
- Enhanced error handling and exceptions
- Updated dependencies in uv.lock

## SDK Updates (sdk/)
- Client methods updated for Temporal workflows
- Updated models for new workflow execution
- Updated dependencies in uv.lock

## Documentation Updates (docs/)
- Architecture documentation for Temporal
- Workflow concept documentation
- Resource management documentation (new)
- Debugging guide (new)
- Updated tutorials and how-to guides
- Troubleshooting updates

## README Updates
- Main README with Temporal instructions
- Backend README
- CLI README
- SDK README

## Other
- Updated IMPLEMENTATION_STATUS.md
- Removed old vulnerable_app.tar.gz

These changes complete the Temporal migration and ensure the
CLI/SDK work correctly with the new backend.

* fix: Use positional args instead of kwargs for Temporal workflows

The Temporal Python SDK's start_workflow() method doesn't accept
a 'kwargs' parameter. Workflows must receive parameters as positional
arguments via the 'args' parameter.

Changed from:
  args=workflow_args  # Positional arguments

This fixes the error:
  TypeError: Client.start_workflow() got an unexpected keyword argument 'kwargs'

Workflows now correctly receive parameters in order:
- security_assessment: [target_id, scanner_config, analyzer_config, reporter_config]
- atheris_fuzzing: [target_id, target_file, max_iterations, timeout_seconds]
- rust_test: [target_id, test_message]

* fix: Filter metadata-only parameters from workflow arguments

SecurityAssessmentWorkflow was receiving 7 arguments instead of 2-5.
The issue was that target_path and volume_mode from default_parameters
were being passed to the workflow, when they should only be used by
the system for configuration.

Now filters out metadata-only parameters (target_path, volume_mode)
before passing arguments to workflow execution.

* refactor: Remove Prefect leftovers and volume mounting legacy

Complete cleanup of Prefect migration artifacts:

Backend:
- Delete registry.py and workflow_discovery.py (Prefect-specific files)
- Remove Docker validation from setup.py (no longer needed)
- Remove ResourceLimits and VolumeMount models
- Remove target_path and volume_mode from WorkflowSubmission
- Remove supported_volume_modes from API and discovery
- Clean up metadata.yaml files (remove volume/path fields)
- Simplify parameter filtering in manager.py

SDK:
- Remove volume_mode parameter from client methods
- Remove ResourceLimits and VolumeMount models
- Remove Prefect error patterns from docker_logs.py
- Clean up WorkflowSubmission and WorkflowMetadata models

CLI:
- Remove Volume Modes display from workflow info

All removed features are Prefect-specific or Docker volume mounting
artifacts. Temporal workflows use MinIO storage exclusively.

* feat: Add comprehensive test suite and benchmark infrastructure

- Add 68 unit tests for fuzzer, scanner, and analyzer modules
- Implement pytest-based test infrastructure with fixtures
- Add 6 performance benchmarks with category-specific thresholds
- Configure GitHub Actions for automated testing and benchmarking
- Add test and benchmark documentation

Test coverage:
- AtherisFuzzer: 8 tests
- CargoFuzzer: 14 tests
- FileScanner: 22 tests
- SecurityAnalyzer: 24 tests

All tests passing (68/68)
All benchmarks passing (6/6)

* fix: Resolve all ruff linting violations across codebase

Fixed 27 ruff violations in 12 files:
- Removed unused imports (Depends, Dict, Any, Optional, etc.)
- Fixed undefined workflow_info variable in workflows.py
- Removed dead code with undefined variables in atheris_fuzzer.py
- Changed f-string to regular string where no placeholders used

All files now pass ruff checks for CI/CD compliance.

* fix: Configure CI for unit tests only

- Renamed docker-compose.temporal.yaml → docker-compose.yml for CI compatibility
- Commented out integration-tests job (no integration tests yet)
- Updated test-summary to only depend on lint and unit-tests

CI will now run successfully with 68 unit tests. Integration tests can be added later.

* feat: Add CI/CD integration with ephemeral deployment model

Implements comprehensive CI/CD support for FuzzForge with on-demand worker management:

**Worker Management (v0.7.0)**
- Add WorkerManager for automatic worker lifecycle control
- Auto-start workers from stopped state when workflows execute
- Auto-stop workers after workflow completion
- Health checks and startup timeout handling (90s default)

**CI/CD Features**
- `--fail-on` flag: Fail builds based on SARIF severity levels (error/warning/note/info)
- `--export-sarif` flag: Export findings in SARIF 2.1.0 format
- `--auto-start`/`--auto-stop` flags: Control worker lifecycle
- Exit code propagation: Returns 1 on blocking findings, 0 on success

**Exit Code Fix**
- Add `except typer.Exit: raise` handlers at 3 critical locations
- Move worker cleanup to finally block for guaranteed execution
- Exit codes now propagate correctly even when build fails

**CI Scripts & Examples**
- ci-start.sh: Start FuzzForge services with health checks
- ci-stop.sh: Clean shutdown with volume preservation option
- GitHub Actions workflow example (security-scan.yml)
- GitLab CI pipeline example (.gitlab-ci.example.yml)
- docker-compose.ci.yml: CI-optimized compose file with profiles

**OSS-Fuzz Integration**
- New ossfuzz_campaign workflow for running OSS-Fuzz projects
- OSS-Fuzz worker with Docker-in-Docker support
- Configurable campaign duration and project selection

**Documentation**
- Comprehensive CI/CD integration guide (docs/how-to/cicd-integration.md)
- Updated architecture docs with worker lifecycle details
- Updated workspace isolation documentation
- CLI README with worker management examples

**SDK Enhancements**
- Add get_workflow_worker_info() endpoint
- Worker vertical metadata in workflow responses

**Testing**
- All workflows tested: security_assessment, atheris_fuzzing, secret_detection, cargo_fuzzing
- All monitoring commands tested: stats, crashes, status, finding
- Full CI pipeline simulation verified
- Exit codes verified for success/failure scenarios

Ephemeral CI/CD model: ~3-4GB RAM, ~60-90s startup, runs entirely in CI containers.

* fix: Resolve ruff linting violations in CI/CD code

- Remove unused variables (run_id, defaults, result)
- Remove unused imports
- Fix f-string without placeholders

All CI/CD integration files now pass ruff checks.
2025-10-14 10:13:45 +02:00

1069 lines
34 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# FuzzForge AI Architecture
**Last Updated:** 2025-10-01
**Status:** Approved Architecture Plan
**Current Phase:** Migration from Prefect to Temporal with Vertical Workers
---
## Table of Contents
1. [Executive Summary](#executive-summary)
2. [Current Architecture (Prefect)](#current-architecture-prefect)
3. [Target Architecture (Temporal + Vertical Workers)](#target-architecture-temporal--vertical-workers)
4. [Vertical Worker Model](#vertical-worker-model)
5. [Storage Strategy (MinIO)](#storage-strategy-minio)
6. [Dynamic Workflow Loading](#dynamic-workflow-loading)
7. [Architecture Principles](#architecture-principles)
8. [Component Details](#component-details)
9. [Scaling Strategy](#scaling-strategy)
10. [File Lifecycle Management](#file-lifecycle-management)
11. [Future: Nomad Migration](#future-nomad-migration)
12. [Migration Timeline](#migration-timeline)
13. [Decision Log](#decision-log)
---
## Executive Summary
### The Decision
**Replace Prefect with Temporal** using a **vertical worker architecture** where each worker is pre-built with domain-specific security toolchains (Android, Rust, Web, iOS, Blockchain, etc.). Use **MinIO** for unified storage across dev and production environments.
### Why This Change?
| Aspect | Current (Prefect) | Target (Temporal + Verticals) |
|--------|-------------------|-------------------------------|
| **Services** | 6 (Server, Postgres, Redis, Registry, Docker-proxy, Worker) | 6 (Temporal, MinIO, MinIO-setup, 3+ vertical workers) |
| **Orchestration** | Prefect (complex) | Temporal (simpler, more reliable) |
| **Worker Model** | Ephemeral containers per workflow | Long-lived vertical workers with pre-built toolchains |
| **Storage** | Docker Registry + volume mounts | MinIO (S3-compatible) with caching |
| **Dynamic Workflows** | Build image per workflow | Mount workflow code as volume (no rebuild) |
| **Target Access** | Host volume mounts (/Users, /home) | Upload to MinIO, download to cache |
| **Memory Usage** | ~1.85GB | ~2.3GB (+24%, worth it for benefits) |
### Key Benefits
1. **Vertical Specialization:** Pre-built toolchains (Android: Frida, apktool; Rust: AFL++, cargo-fuzz)
2. **Zero Startup Overhead:** Long-lived workers (no 5s container spawn per workflow)
3. **Dynamic Workflows:** Add workflows without rebuilding images (mount as volume)
4. **Unified Storage:** MinIO works identically in dev and prod (no environment-specific code)
5. **Better Security:** No host filesystem mounts, isolated uploaded targets
6. **Automatic Cleanup:** MinIO lifecycle policies handle file expiration
7. **Marketing Advantage:** Sell "security verticals" not "generic orchestration" (safer Nomad BSL positioning)
8. **Scalability:** Clear path from single-host to multi-host to Nomad cluster
---
## Current Architecture (Prefect)
### Infrastructure Components
```
┌─────────────────────────────────────────────────────────┐
│ Docker Compose Stack (6 services) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prefect │ │ Postgres │ │ Redis │ │
│ │ Server │ │ (metadata) │ │ (queue) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Registry │ │ Docker Proxy │ │ Prefect │ │
│ │ (images) │ │ (isolation) │ │ Worker │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
```
### Pain Points
- **Complexity:** 6 services to manage, configure, and monitor
- **Registry overhead:** Must push/pull images for every workflow deployment
- **Volume mounting complexity:** job_variables configuration per workflow
- **Dynamic workflows:** Requires rebuilding and pushing Docker images
- **Scalability:** Unclear how to scale beyond single host
- **Resource usage:** ~1.85GB baseline
---
## Target Architecture (Temporal + Vertical Workers)
### Infrastructure Overview
```
┌───────────────────────────────────────────────────────────────┐
│ FuzzForge Platform │
│ │
│ ┌──────────────────┐ ┌─────────────────────────┐ │
│ │ Temporal Server │◄────────│ MinIO (S3 Storage) │ │
│ │ - Workflows │ │ - Uploaded targets │ │
│ │ - State mgmt │ │ - Results (optional) │ │
│ │ - Task queues │ │ - Lifecycle policies │ │
│ └────────┬─────────┘ └─────────────────────────┘ │
│ │ │
│ │ (Task queue routing) │
│ │ │
│ ┌────────┴────────────────────────────────────────────────┐ │
│ │ Vertical Workers (Long-lived) │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐│ │
│ │ │ Android │ │ Rust/Native │ │ Web/JS ││ │
│ │ │ - apktool │ │ - AFL++ │ │ - Node.js ││ │
│ │ │ - Frida │ │ - cargo-fuzz │ │ - OWASP ZAP ││ │
│ │ │ - jadx │ │ - gdb │ │ - semgrep ││ │
│ │ │ - MobSF │ │ - valgrind │ │ - eslint ││ │
│ │ └───────────────┘ └───────────────┘ └─────────────┘│ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ iOS │ │ Blockchain │ │ │
│ │ │ - class-dump │ │ - mythril │ │ │
│ │ │ - Clutch │ │ - slither │ │ │
│ │ │ - Frida │ │ - echidna │ │ │
│ │ │ - Hopper │ │ - manticore │ │ │
│ │ └───────────────┘ └───────────────┘ │ │
│ │ │ │
│ │ All workers have: │ │
│ │ - /app/toolbox mounted (workflow code) │ │
│ │ - /cache for MinIO downloads │ │
│ │ - Dynamic workflow discovery at startup │ │
│ └──────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
```
### Service Breakdown
```yaml
services:
temporal: # Workflow orchestration + embedded SQLite (dev) or Postgres (prod)
minio: # S3-compatible storage for targets and results
minio-setup: # One-time: create buckets, set policies
worker-android: # Android security vertical (scales independently)
worker-rust: # Rust/native security vertical
worker-web: # Web security vertical
# Additional verticals as needed: ios, blockchain, go, etc.
Total: 6+ services (scales with verticals)
```
### Resource Usage
```
Temporal: ~500MB (includes embedded DB in dev)
MinIO: ~256MB (with CI_CD=true flag)
MinIO-setup: ~20MB (ephemeral, exits after setup)
Worker-android: ~512MB (varies by toolchain)
Worker-rust: ~512MB
Worker-web: ~512MB
─────────────────────────
Total: ~2.3GB (vs 1.85GB Prefect = +24%)
Note: +450MB overhead is worth it for:
- Unified dev/prod architecture
- No host filesystem mounts (security)
- Auto cleanup (lifecycle policies)
- Multi-host ready
```
---
## Vertical Worker Model
### Concept
Instead of generic workers that spawn workflow-specific containers, we have **specialized long-lived workers** pre-built with complete security toolchains for specific domains.
### Vertical Taxonomy
| Vertical | Tools Included | Use Cases | Workflows |
|----------|---------------|-----------|-----------|
| **android** | apktool, jadx, Frida, MobSF, androguard | APK analysis, reverse engineering, dynamic instrumentation | APK security assessment, malware analysis, repackaging detection |
| **rust** | AFL++, cargo-fuzz, gdb, valgrind, AddressSanitizer | Native fuzzing, memory safety | Cargo fuzzing campaigns, binary analysis |
| **web** | Node.js, OWASP ZAP, Burp Suite, semgrep, eslint | Web app security testing | XSS detection, SQL injection scanning, API fuzzing |
| **ios** | class-dump, Clutch, Frida, Hopper, ios-deploy | iOS app analysis | IPA analysis, jailbreak detection, runtime hooking |
| **blockchain** | mythril, slither, echidna, manticore, solc | Smart contract security | Solidity static analysis, property-based fuzzing |
| **go** | go-fuzz, staticcheck, gosec, dlv | Go security testing | Go fuzzing, static analysis |
### Vertical Worker Architecture
```dockerfile
# Example: workers/android/Dockerfile
FROM python:3.11-slim
# Install Android SDK and tools
RUN apt-get update && apt-get install -y \
openjdk-17-jdk \
android-sdk \
&& rm -rf /var/lib/apt/lists/*
# Install security tools
RUN pip install --no-cache-dir \
apktool \
androguard \
frida-tools \
pyaxmlparser
# Install MobSF dependencies
RUN apt-get update && apt-get install -y \
libxml2-dev \
libxslt-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Temporal Python SDK
RUN pip install --no-cache-dir \
temporalio \
boto3 \
pydantic
# Copy worker entrypoint
COPY worker.py /app/
WORKDIR /app
# Worker will mount /app/toolbox and discover workflows at runtime
CMD ["python", "worker.py"]
```
### Dynamic Workflow Discovery
```python
# workers/android/worker.py
import asyncio
from pathlib import Path
from temporalio.client import Client
from temporalio.worker import Worker
async def discover_workflows(vertical: str):
"""Discover workflows for this vertical from mounted toolbox"""
workflows = []
toolbox = Path("/app/toolbox/workflows")
for workflow_dir in toolbox.iterdir():
if not workflow_dir.is_dir():
continue
metadata_file = workflow_dir / "metadata.yaml"
if not metadata_file.exists():
continue
# Parse metadata
with open(metadata_file) as f:
metadata = yaml.safe_load(f)
# Check if workflow is for this vertical
if metadata.get("vertical") == vertical:
# Dynamically import workflow module
workflow_module = f"toolbox.workflows.{workflow_dir.name}.workflow"
module = __import__(workflow_module, fromlist=[''])
# Find @workflow.defn decorated classes
for name, obj in inspect.getmembers(module, inspect.isclass):
if hasattr(obj, '__temporal_workflow_definition'):
workflows.append(obj)
logger.info(f"Discovered workflow: {name} for vertical {vertical}")
return workflows
async def main():
vertical = os.getenv("WORKER_VERTICAL", "android")
temporal_address = os.getenv("TEMPORAL_ADDRESS", "localhost:7233")
# Discover workflows for this vertical
workflows = await discover_workflows(vertical)
if not workflows:
logger.warning(f"No workflows found for vertical: {vertical}")
return
# Connect to Temporal
client = await Client.connect(temporal_address)
# Start worker with discovered workflows
worker = Worker(
client,
task_queue=f"{vertical}-queue",
workflows=workflows,
activities=[
get_target_activity,
cleanup_cache_activity,
# ... vertical-specific activities
]
)
logger.info(f"Worker started for vertical '{vertical}' with {len(workflows)} workflows")
await worker.run()
if __name__ == "__main__":
asyncio.run(main())
```
### Workflow Declaration
```yaml
# toolbox/workflows/android_apk_analysis/metadata.yaml
name: android_apk_analysis
version: 1.0.0
description: "Deep analysis of Android APK files"
vertical: android # ← Routes to worker-android
dependencies:
python:
- androguard==4.1.0 # Additional Python deps (optional)
- pyaxmlparser==0.3.28
```
```python
# toolbox/workflows/android_apk_analysis/workflow.py
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class AndroidApkAnalysisWorkflow:
"""
Comprehensive Android APK security analysis
Runs in worker-android with apktool, Frida, jadx pre-installed
"""
@workflow.run
async def run(self, target_id: str) -> dict:
# Activity 1: Download target from MinIO
apk_path = await workflow.execute_activity(
"get_target",
target_id,
start_to_close_timeout=timedelta(minutes=5)
)
# Activity 2: Extract manifest (uses apktool - pre-installed)
manifest = await workflow.execute_activity(
"extract_manifest",
apk_path,
start_to_close_timeout=timedelta(minutes=5)
)
# Activity 3: Static analysis (uses jadx - pre-installed)
static_results = await workflow.execute_activity(
"static_analysis",
apk_path,
start_to_close_timeout=timedelta(minutes=30)
)
# Activity 4: Frida instrumentation (uses Frida - pre-installed)
dynamic_results = await workflow.execute_activity(
"dynamic_analysis",
apk_path,
start_to_close_timeout=timedelta(hours=2)
)
# Activity 5: Cleanup local cache
await workflow.execute_activity(
"cleanup_cache",
apk_path,
start_to_close_timeout=timedelta(minutes=1)
)
return {
"manifest": manifest,
"static": static_results,
"dynamic": dynamic_results
}
```
---
## Storage Strategy (MinIO)
### Why MinIO?
**Goal:** Unified storage that works identically in dev and production, eliminating environment-specific code.
**Alternatives considered:**
1.**LocalVolumeStorage** (mount /Users, /home): Security risk, platform-specific, doesn't scale
2.**Different storage per environment**: Complex, error-prone, dual maintenance
3.**MinIO everywhere**: Lightweight (+256MB), S3-compatible, multi-host ready
### MinIO Configuration
```yaml
# docker-compose.yaml
services:
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
ports:
- "9000:9000" # S3 API
- "9001:9001" # Web Console (http://localhost:9001)
volumes:
- minio_data:/data
environment:
MINIO_ROOT_USER: fuzzforge
MINIO_ROOT_PASSWORD: fuzzforge123
MINIO_CI_CD: "true" # Reduces memory to 256MB (from 1GB)
healthcheck:
test: ["CMD", "mc", "ready", "local"]
interval: 5s
timeout: 5s
retries: 5
# One-time setup: create buckets and set lifecycle policies
minio-setup:
image: minio/mc:latest
depends_on:
minio:
condition: service_healthy
entrypoint: >
/bin/sh -c "
mc alias set fuzzforge http://minio:9000 fuzzforge fuzzforge123;
mc mb fuzzforge/targets --ignore-existing;
mc mb fuzzforge/results --ignore-existing;
mc ilm add fuzzforge/targets --expiry-days 7;
mc anonymous set download fuzzforge/results;
"
```
### Storage Backend Implementation
```python
# backend/src/storage/s3_cached.py
import boto3
from pathlib import Path
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class S3CachedStorage:
"""
S3-compatible storage with local caching.
Works with MinIO (dev/prod) or AWS S3 (cloud).
"""
def __init__(self):
self.s3 = boto3.client(
's3',
endpoint_url=os.getenv('S3_ENDPOINT', 'http://minio:9000'),
aws_access_key_id=os.getenv('S3_ACCESS_KEY', 'fuzzforge'),
aws_secret_access_key=os.getenv('S3_SECRET_KEY', 'fuzzforge123')
)
self.bucket = os.getenv('S3_BUCKET', 'targets')
self.cache_dir = Path(os.getenv('CACHE_DIR', '/cache'))
self.cache_max_size = self._parse_size(os.getenv('CACHE_MAX_SIZE', '10GB'))
self.cache_ttl = self._parse_duration(os.getenv('CACHE_TTL', '7d'))
async def upload_target(self, file_path: Path, user_id: str) -> str:
"""Upload target to MinIO and return target ID"""
target_id = str(uuid4())
# Upload with metadata for lifecycle management
self.s3.upload_file(
str(file_path),
self.bucket,
f'{target_id}/target',
ExtraArgs={
'Metadata': {
'user_id': user_id,
'uploaded_at': datetime.now().isoformat(),
'filename': file_path.name
}
}
)
logger.info(f"Uploaded target {target_id} ({file_path.name})")
return target_id
async def get_target(self, target_id: str) -> Path:
"""
Get target from cache or download from MinIO.
Returns local path to cached file.
"""
cache_path = self.cache_dir / target_id
cached_file = cache_path / "target"
# Check cache
if cached_file.exists():
# Update access time for LRU
cached_file.touch()
logger.info(f"Cache hit: {target_id}")
return cached_file
# Cache miss - download from MinIO
logger.info(f"Cache miss: {target_id}, downloading from MinIO")
cache_path.mkdir(parents=True, exist_ok=True)
self.s3.download_file(
self.bucket,
f'{target_id}/target',
str(cached_file)
)
return cached_file
async def cleanup_cache(self):
"""LRU eviction when cache exceeds max size"""
cache_files = []
total_size = 0
for cache_file in self.cache_dir.rglob('*'):
if cache_file.is_file():
stat = cache_file.stat()
cache_files.append({
'path': cache_file,
'size': stat.st_size,
'atime': stat.st_atime
})
total_size += stat.st_size
if total_size > self.cache_max_size:
# Sort by access time (oldest first)
cache_files.sort(key=lambda x: x['atime'])
for file_info in cache_files:
if total_size <= self.cache_max_size:
break
file_info['path'].unlink()
total_size -= file_info['size']
logger.info(f"Evicted from cache: {file_info['path']}")
```
### Performance Characteristics
| Operation | Direct Filesystem | MinIO (Local) | Impact |
|-----------|------------------|---------------|---------|
| Small file (<1MB) | ~1ms | ~5-10ms | Negligible for security workflows |
| Large file (>100MB) | ~200ms | ~220ms | ~10% overhead |
| Workflow duration | 5-60 minutes | 5-60 minutes + 2-4s upload | <1% overhead |
| Subsequent scans | Same | **Cached (0ms)** | Better than filesystem |
**Verdict:** 2-4 second upload overhead is **negligible** for workflows that run 5-60 minutes.
### Workspace Isolation
To support concurrent workflows safely, FuzzForge implements workspace isolation with three modes:
**1. Isolated Mode (Default)**
```python
# Each workflow run gets its own workspace
cache_path = f"/cache/{target_id}/{run_id}/workspace/"
```
- **Use for:** Fuzzing workflows that modify files (corpus, crashes)
- **Advantages:** Safe for concurrent execution, no file conflicts
- **Cleanup:** Entire run directory removed after workflow completes
**2. Shared Mode**
```python
# All runs share the same workspace
cache_path = f"/cache/{target_id}/workspace/"
```
- **Use for:** Read-only analysis workflows (security scanning, static analysis)
- **Advantages:** Efficient (downloads once), lower bandwidth/storage
- **Cleanup:** No cleanup (workspace persists for reuse)
**3. Copy-on-Write Mode**
```python
# Download once to shared location, copy per run
shared_cache = f"/cache/{target_id}/shared/workspace/"
run_cache = f"/cache/{target_id}/{run_id}/workspace/"
```
- **Use for:** Large targets that need isolation
- **Advantages:** Download once, isolated per-run execution
- **Cleanup:** Run-specific copies removed, shared cache persists
**Configuration:**
Workflows specify isolation mode in `metadata.yaml`:
```yaml
name: atheris_fuzzing
workspace_isolation: "isolated" # or "shared" or "copy-on-write"
```
Workers automatically handle download, extraction, and cleanup based on the mode.
---
## Dynamic Workflow Loading
### The Problem
**Requirement:** Workflows must be dynamically added without modifying the codebase or rebuilding Docker images.
**Traditional approach (doesn't work):**
- Build Docker image per workflow with dependencies
- Push to registry
- Worker pulls and spawns container
- ❌ Requires rebuild for every workflow change
- ❌ Registry overhead
- ❌ Slow (5-10s startup per workflow)
**Our approach (works):**
- Workflow code mounted as volume into long-lived workers
- Workers scan `/app/toolbox/workflows` at startup
- Dynamically import and register workflows matching vertical
- ✅ No rebuild needed
- ✅ No registry
- ✅ Zero startup overhead
### Implementation
**1. Docker Compose volume mount:**
```yaml
worker-android:
volumes:
- ./toolbox:/app/toolbox:ro # Mount workflow code as read-only
```
**2. Worker discovers workflows:**
```python
# Runs at worker startup
for workflow_dir in Path("/app/toolbox/workflows").iterdir():
metadata = yaml.safe_load((workflow_dir / "metadata.yaml").read_text())
# Only load workflows for this vertical
if metadata.get("vertical") == os.getenv("WORKER_VERTICAL"):
# Dynamically import workflow.py
module = importlib.import_module(f"toolbox.workflows.{workflow_dir.name}.workflow")
# Find @workflow.defn classes
workflows.append(module.MyWorkflowClass)
```
**3. Developer adds workflow:**
```bash
# 1. Create workflow directory
mkdir -p toolbox/workflows/my_new_workflow
# 2. Write metadata
cat > toolbox/workflows/my_new_workflow/metadata.yaml <<EOF
vertical: android
EOF
# 3. Write workflow
cat > toolbox/workflows/my_new_workflow/workflow.py <<EOF
from temporalio import workflow
@workflow.defn
class MyNewWorkflow:
@workflow.run
async def run(self, target_id: str):
# Implementation
pass
EOF
# 4. Restart worker to pick up new workflow
docker-compose restart worker-android
# Done! No image building, no registry push
```
### Hot Reload (Optional Advanced Feature)
```python
# Worker watches /app/toolbox for file changes
import watchdog
observer = Observer()
observer.schedule(WorkflowReloadHandler(), "/app/toolbox/workflows", recursive=True)
observer.start()
class WorkflowReloadHandler(FileSystemEventHandler):
def on_modified(self, event):
if event.src_path.endswith(".py"):
logger.info(f"Detected workflow change: {event.src_path}")
# Reload workflow without restarting worker
reload_workflows()
```
---
## Architecture Principles
### 1. Vertical Specialization
**Principle:** Each worker is specialized for a security domain with pre-built toolchains.
**Benefits:**
- Faster workflow execution (tools already installed)
- Better resource utilization (long-lived workers)
- Clear marketing positioning (sell verticals, not orchestration)
- Easier development (known toolchain per vertical)
### 2. Unified Storage
**Principle:** Same storage backend (MinIO) in development and production.
**Benefits:**
- No environment-specific code
- Easier testing (dev = prod)
- Multi-host ready from day one
- Better security (no host mounts)
### 3. Dynamic Workflow Discovery
**Principle:** Workflows are discovered and loaded at runtime, not compile-time.
**Benefits:**
- Add workflows without rebuilding images
- No registry overhead
- Faster iteration for developers
- Supports user-contributed workflows
### 4. Environment-Driven Configuration
**Principle:** All configuration via environment variables, no hardcoded values.
**Required Variables:**
```bash
# Worker configuration
TEMPORAL_ADDRESS=temporal:7233
WORKER_VERTICAL=android
MAX_CONCURRENT_ACTIVITIES=5
# Storage configuration
S3_ENDPOINT=http://minio:9000
S3_ACCESS_KEY=fuzzforge
S3_SECRET_KEY=fuzzforge123
S3_BUCKET=targets
# Cache configuration
CACHE_DIR=/cache
CACHE_MAX_SIZE=10GB
CACHE_TTL=7d
```
### 5. Fail-Safe Defaults
**Principle:** System works out-of-the-box with sensible defaults.
**Examples:**
- MinIO CI_CD mode (lightweight for dev)
- 7-day lifecycle policy (auto-cleanup)
- 10GB cache limit (prevent disk exhaustion)
- Embedded SQLite for Temporal (no Postgres in dev)
---
## Component Details
### Temporal Server
**Deployment Options:**
| Environment | Database | Memory | Notes |
|-------------|----------|--------|-------|
| Development | SQLite (embedded) | 500MB | Simple, no external DB |
| Production | PostgreSQL | 2GB | Clustered for HA |
**Configuration:**
```yaml
temporal:
image: temporalio/auto-setup:latest
ports:
- "7233:7233" # gRPC
- "8233:8233" # Web UI
environment:
- DB=sqlite # or postgresql for prod
- SQLITE_PRAGMA_journal_mode=WAL
```
### MinIO
**Resource Usage:**
- Memory: 256MB (CI_CD mode) to 1GB (production)
- CPU: Minimal (I/O bound)
- Disk: Depends on usage (recommend 100GB+)
**Configuration:**
```yaml
minio:
image: minio/minio:latest
environment:
MINIO_CI_CD: "true" # Lightweight mode
MINIO_ROOT_USER: fuzzforge
MINIO_ROOT_PASSWORD: fuzzforge123
```
**Web Console:** http://localhost:9001
### Vertical Workers
**Base Requirements:**
- Python 3.11+
- Temporal Python SDK
- boto3 (S3 client)
- Domain-specific tools
**Scaling:**
```yaml
# Scale vertically (more concurrent activities per worker)
environment:
MAX_CONCURRENT_ACTIVITIES: 10 # Default: 5
# Scale horizontally (more workers)
docker-compose up -d --scale worker-android=3
```
---
## Scaling Strategy
### Phase 1: Single Host (Now - 6 months)
**Configuration:**
```yaml
# 1 Temporal + 1 MinIO + 3-5 vertical workers
# Capacity: 15-50 concurrent workflows
# Cost: ~$430/month
```
**When to move to Phase 2:** Saturating single host (CPU >80%, memory >90%)
### Phase 2: Multi-Host (6-18 months)
**Configuration:**
```
Host 1: Temporal + MinIO
Host 2: 5× worker-android
Host 3: 5× worker-rust
Host 4: 5× worker-web
```
**Changes required:**
```yaml
# Point all workers to central Temporal/MinIO
environment:
TEMPORAL_ADDRESS: temporal.prod.fuzzforge.ai:7233
S3_ENDPOINT: http://minio.prod.fuzzforge.ai:9000
```
**Capacity:** 3× Phase 1 = 45-150 concurrent workflows
### Phase 3: Nomad Cluster (18+ months, if needed)
**Trigger Points:**
- Managing 10+ hosts manually
- Need auto-scaling based on queue depth
- Need multi-tenancy (customer namespaces)
**Migration effort:** 1-2 weeks (workers unchanged, just change deployment method)
---
## File Lifecycle Management
### Automatic Cleanup via MinIO Lifecycle Policies
```bash
# Set on bucket (done by minio-setup service)
mc ilm add fuzzforge/targets --expiry-days 7
# MinIO automatically deletes objects older than 7 days
```
### Local Cache Eviction (LRU)
```python
# Worker background task (runs every 30 minutes)
async def cleanup_cache_task():
while True:
await storage.cleanup_cache() # LRU eviction
await asyncio.sleep(1800) # 30 minutes
```
### Manual Deletion (API)
```python
@app.delete("/api/targets/{target_id}")
async def delete_target(target_id: str):
"""Allow users to manually delete uploaded targets"""
s3.delete_object(Bucket='targets', Key=f'{target_id}/target')
return {"status": "deleted"}
```
### Retention Policies
| Object Type | Default TTL | Configurable | Notes |
|-------------|-------------|--------------|-------|
| Uploaded targets | 7 days | Yes (env var) | Auto-deleted by MinIO |
| Worker cache | LRU (10GB limit) | Yes | Evicted when cache full |
| Workflow results | 30 days (optional) | Yes | Can store in MinIO |
---
## Future: Nomad Migration
### When to Add Nomad?
**Trigger points:**
- Managing 10+ hosts manually becomes painful
- Need auto-scaling based on queue depth
- Need multi-tenancy with resource quotas
- Want sophisticated scheduling (bin-packing, affinity rules)
**Estimated timing:** 18-24 months
### Migration Complexity
**Effort:** 1-2 weeks
**What changes:**
- Deployment method (docker-compose → Nomad jobs)
- Orchestration layer (manual → Nomad scheduler)
**What stays the same:**
- Worker Docker images (unchanged)
- Workflows (unchanged)
- Temporal (unchanged)
- MinIO (unchanged)
- Storage backend (unchanged)
### Nomad Job Example
```hcl
job "fuzzforge-worker-android" {
datacenters = ["dc1"]
type = "service"
group "workers" {
count = 5 # Auto-scales based on queue depth
scaling {
min = 1
max = 20
policy {
evaluation_interval = "30s"
check "queue_depth" {
source = "prometheus"
query = "temporal_queue_depth{queue='android-queue'}"
strategy "target-value" {
target = 10 # Scale up if >10 tasks queued
}
}
}
}
task "worker" {
driver = "docker"
config {
image = "fuzzforge/worker-android:latest"
volumes = [
"/opt/fuzzforge/toolbox:/app/toolbox:ro"
]
}
env {
TEMPORAL_ADDRESS = "temporal.service.consul:7233"
WORKER_VERTICAL = "android"
S3_ENDPOINT = "http://minio.service.consul:9000"
}
resources {
cpu = 500 # MHz
memory = 512 # MB
}
}
}
}
```
### Licensing Considerations
**Nomad BSL 1.1 Risk:** Depends on FuzzForge positioning
**Safe positioning (LOW risk):**
- ✅ Market as "Android/Rust/Web security verticals"
- ✅ Emphasize domain expertise, not orchestration
- ✅ Nomad is internal infrastructure
- ✅ Customers buy security services, not Nomad
**Risky positioning (MEDIUM risk):**
- ⚠️ Market as "generic workflow orchestration platform"
- ⚠️ Emphasize flexibility over domain expertise
- ⚠️ Could be seen as competing with HashiCorp
**Mitigation:**
- Keep marketing focused on security verticals
- Get legal review before Phase 3
- Alternative: Use Kubernetes (Apache 2.0, zero risk)
---
## Migration Timeline
### Phase 1: Foundation (Weeks 1-2)
- ✅ Create feature branch
- Set up Temporal docker-compose
- Add MinIO service
- Implement S3CachedStorage backend
- Create cleanup/lifecycle logic
### Phase 2: First Vertical Worker (Weeks 3-4)
- Design worker base template
- Create worker-rust with AFL++, cargo-fuzz
- Implement dynamic workflow discovery
- Test workflow loading from mounted volume
### Phase 3: Migrate Workflows (Weeks 5-6)
- Port security_assessment workflow to Temporal
- Update workflow metadata format
- Test end-to-end flow (upload → analyze → results)
- Verify cleanup/lifecycle
### Phase 4: Additional Verticals (Weeks 7-8)
- Create worker-android, worker-web
- Document vertical development guide
- Update CLI for MinIO uploads
- Update backend API for Temporal
### Phase 5: Testing & Docs (Weeks 9-10)
- Comprehensive testing
- Update README
- Migration guide for existing users
- Troubleshooting documentation
**Total: 10 weeks, rollback possible at any phase**
---
## Decision Log
### 2025-09-30: Initial Architecture Decision
- **Decision:** Migrate from Prefect to Temporal
- **Rationale:** Simpler infrastructure, better reliability, clear scaling path
### 2025-10-01: Vertical Worker Model
- **Decision:** Use long-lived vertical workers instead of ephemeral per-workflow containers
- **Rationale:**
- Zero startup overhead (5s saved per workflow)
- Pre-built toolchains (Android, Rust, Web, etc.)
- Dynamic workflows via mounted volumes (no image rebuild)
- Better marketing (sell verticals, not orchestration)
- Safer Nomad BSL positioning
### 2025-10-01: Unified MinIO Storage
- **Decision:** Use MinIO for both dev and production (no LocalVolumeStorage)
- **Rationale:**
- Unified codebase (no environment-specific code)
- Lightweight (256MB with CI_CD=true)
- Negligible overhead (2-4s for 250MB upload)
- Better security (no host filesystem mounts)
- Multi-host ready
- Automatic cleanup via lifecycle policies
### 2025-10-01: Dynamic Workflow Loading
- **Decision:** Mount workflow code as volume, discover at runtime
- **Rationale:**
- Add workflows without rebuilding images
- No registry overhead
- Supports user-contributed workflows
- Faster iteration for developers
---
**Document Version:** 2.0
**Last Updated:** 2025-10-01
**Next Review:** After Phase 1 implementation (2 weeks)