fuzzforge_ai/ARCHITECTURE.md

# FuzzForge AI Architecture

**Last Updated:** 2025-10-01
**Status:** Approved Architecture Plan
**Current Phase:** Migration from Prefect to Temporal with Vertical Workers

---

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [Current Architecture (Prefect)](#current-architecture-prefect)
3. [Target Architecture (Temporal + Vertical Workers)](#target-architecture-temporal--vertical-workers)
4. [Vertical Worker Model](#vertical-worker-model)
5. [Storage Strategy (MinIO)](#storage-strategy-minio)
6. [Dynamic Workflow Loading](#dynamic-workflow-loading)
7. [Architecture Principles](#architecture-principles)
8. [Component Details](#component-details)
9. [Scaling Strategy](#scaling-strategy)
10. [File Lifecycle Management](#file-lifecycle-management)
11. [Future: Nomad Migration](#future-nomad-migration)
12. [Migration Timeline](#migration-timeline)
13. [Decision Log](#decision-log)

---

## Executive Summary

### The Decision

**Replace Prefect with Temporal** using a **vertical worker architecture** where each worker is pre-built with domain-specific security toolchains (Android, Rust, Web, iOS, Blockchain, etc.). Use **MinIO** for unified storage across dev and production environments.

### Why This Change?

| Aspect | Current (Prefect) | Target (Temporal + Verticals) |
|--------|-------------------|-------------------------------|
| **Services** | 6 (Server, Postgres, Redis, Registry, Docker-proxy, Worker) | 6 (Temporal, MinIO, MinIO-setup, 3+ vertical workers) |
| **Orchestration** | Prefect (complex) | Temporal (simpler, more reliable) |
| **Worker Model** | Ephemeral containers per workflow | Long-lived vertical workers with pre-built toolchains |
| **Storage** | Docker Registry + volume mounts | MinIO (S3-compatible) with caching |
| **Dynamic Workflows** | Build image per workflow | Mount workflow code as volume (no rebuild) |
| **Target Access** | Host volume mounts (/Users, /home) | Upload to MinIO, download to cache |
| **Memory Usage** | ~1.85GB | ~2.3GB (+24%, worth it for benefits) |

### Key Benefits

1. **Vertical Specialization:** Pre-built toolchains (Android: Frida, apktool; Rust: AFL++, cargo-fuzz)
2. **Zero Startup Overhead:** Long-lived workers (no 5s container spawn per workflow)
3. **Dynamic Workflows:** Add workflows without rebuilding images (mount as volume)
4. **Unified Storage:** MinIO works identically in dev and prod (no environment-specific code)
5. **Better Security:** No host filesystem mounts, isolated uploaded targets
6. **Automatic Cleanup:** MinIO lifecycle policies handle file expiration
7. **Marketing Advantage:** Sell "security verticals" not "generic orchestration" (safer Nomad BSL positioning)
8. **Scalability:** Clear path from single-host to multi-host to Nomad cluster

---

## Current Architecture (Prefect)

### Infrastructure Components

```
┌─────────────────────────────────────────────────────────┐
│ Docker Compose Stack (6 services)                       │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Prefect      │  │ Postgres     │  │ Redis        │ │
│  │ Server       │  │ (metadata)   │  │ (queue)      │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Registry     │  │ Docker Proxy │  │ Prefect      │ │
│  │ (images)     │  │ (isolation)  │  │ Worker       │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└─────────────────────────────────────────────────────────┘
```

### Pain Points

- **Complexity:** 6 services to manage, configure, and monitor
- **Registry overhead:** Must push/pull images for every workflow deployment
- **Volume mounting complexity:** job_variables configuration per workflow
- **Dynamic workflows:** Requires rebuilding and pushing Docker images
- **Scalability:** Unclear how to scale beyond single host
- **Resource usage:** ~1.85GB baseline

---

## Target Architecture (Temporal + Vertical Workers)

### Infrastructure Overview

```
┌───────────────────────────────────────────────────────────────┐
│ FuzzForge Platform                                            │
│                                                                │
│  ┌──────────────────┐         ┌─────────────────────────┐   │
│  │ Temporal Server  │◄────────│ MinIO (S3 Storage)      │   │
│  │ - Workflows      │         │ - Uploaded targets      │   │
│  │ - State mgmt     │         │ - Results (optional)    │   │
│  │ - Task queues    │         │ - Lifecycle policies    │   │
│  └────────┬─────────┘         └─────────────────────────┘   │
│           │                                                    │
│           │ (Task queue routing)                              │
│           │                                                    │
│  ┌────────┴────────────────────────────────────────────────┐ │
│  │ Vertical Workers (Long-lived)                            │ │
│  │                                                          │ │
│  │  ┌───────────────┐  ┌───────────────┐  ┌─────────────┐│ │
│  │  │ Android       │  │ Rust/Native   │  │ Web/JS      ││ │
│  │  │ - apktool     │  │ - AFL++       │  │ - Node.js   ││ │
│  │  │ - Frida       │  │ - cargo-fuzz  │  │ - OWASP ZAP ││ │
│  │  │ - jadx        │  │ - gdb         │  │ - semgrep   ││ │
│  │  │ - MobSF       │  │ - valgrind    │  │ - eslint    ││ │
│  │  └───────────────┘  └───────────────┘  └─────────────┘│ │
│  │                                                          │ │
│  │  ┌───────────────┐  ┌───────────────┐                  │ │
│  │  │ iOS           │  │ Blockchain    │                  │ │
│  │  │ - class-dump  │  │ - mythril     │                  │ │
│  │  │ - Clutch      │  │ - slither     │                  │ │
│  │  │ - Frida       │  │ - echidna     │                  │ │
│  │  │ - Hopper      │  │ - manticore   │                  │ │
│  │  └───────────────┘  └───────────────┘                  │ │
│  │                                                          │ │
│  │  All workers have:                                       │ │
│  │  - /app/toolbox mounted (workflow code)                 │ │
│  │  - /cache for MinIO downloads                           │ │
│  │  - Dynamic workflow discovery at startup                │ │
│  └──────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
```

### Service Breakdown

```yaml
services:
  temporal:         # Workflow orchestration + embedded SQLite (dev) or Postgres (prod)
  minio:            # S3-compatible storage for targets and results
  minio-setup:      # One-time: create buckets, set policies
  worker-android:   # Android security vertical (scales independently)
  worker-rust:      # Rust/native security vertical
  worker-web:       # Web security vertical
  # Additional verticals as needed: ios, blockchain, go, etc.

Total: 6+ services (scales with verticals)
```

### Resource Usage

```
Temporal:        ~500MB  (includes embedded DB in dev)
MinIO:           ~256MB  (with CI_CD=true flag)
MinIO-setup:     ~20MB   (ephemeral, exits after setup)
Worker-android:  ~512MB  (varies by toolchain)
Worker-rust:     ~512MB
Worker-web:      ~512MB
─────────────────────────
Total:           ~2.3GB  (vs 1.85GB Prefect = +24%)

Note: +450MB overhead is worth it for:
  - Unified dev/prod architecture
  - No host filesystem mounts (security)
  - Auto cleanup (lifecycle policies)
  - Multi-host ready
```

---

## Vertical Worker Model

### Concept

Instead of generic workers that spawn workflow-specific containers, we have **specialized long-lived workers** pre-built with complete security toolchains for specific domains.

### Vertical Taxonomy

| Vertical | Tools Included | Use Cases | Workflows |
|----------|---------------|-----------|-----------|
| **android** | apktool, jadx, Frida, MobSF, androguard | APK analysis, reverse engineering, dynamic instrumentation | APK security assessment, malware analysis, repackaging detection |
| **rust** | AFL++, cargo-fuzz, gdb, valgrind, AddressSanitizer | Native fuzzing, memory safety | Cargo fuzzing campaigns, binary analysis |
| **web** | Node.js, OWASP ZAP, Burp Suite, semgrep, eslint | Web app security testing | XSS detection, SQL injection scanning, API fuzzing |
| **ios** | class-dump, Clutch, Frida, Hopper, ios-deploy | iOS app analysis | IPA analysis, jailbreak detection, runtime hooking |
| **blockchain** | mythril, slither, echidna, manticore, solc | Smart contract security | Solidity static analysis, property-based fuzzing |
| **go** | go-fuzz, staticcheck, gosec, dlv | Go security testing | Go fuzzing, static analysis |

### Vertical Worker Architecture

```dockerfile
# Example: workers/android/Dockerfile
FROM python:3.11-slim

# Install Android SDK and tools
RUN apt-get update && apt-get install -y \
    openjdk-17-jdk \
    android-sdk \
    && rm -rf /var/lib/apt/lists/*

# Install security tools
RUN pip install --no-cache-dir \
    apktool \
    androguard \
    frida-tools \
    pyaxmlparser

# Install MobSF dependencies
RUN apt-get update && apt-get install -y \
    libxml2-dev \
    libxslt-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Temporal Python SDK
RUN pip install --no-cache-dir \
    temporalio \
    boto3 \
    pydantic

# Copy worker entrypoint
COPY worker.py /app/
WORKDIR /app

# Worker will mount /app/toolbox and discover workflows at runtime
CMD ["python", "worker.py"]
```

### Dynamic Workflow Discovery

```python
# workers/android/worker.py
import asyncio
from pathlib import Path
from temporalio.client import Client
from temporalio.worker import Worker

async def discover_workflows(vertical: str):
    """Discover workflows for this vertical from mounted toolbox"""
    workflows = []
    toolbox = Path("/app/toolbox/workflows")

    for workflow_dir in toolbox.iterdir():
        if not workflow_dir.is_dir():
            continue

        metadata_file = workflow_dir / "metadata.yaml"
        if not metadata_file.exists():
            continue

        # Parse metadata
        with open(metadata_file) as f:
            metadata = yaml.safe_load(f)

        # Check if workflow is for this vertical
        if metadata.get("vertical") == vertical:
            # Dynamically import workflow module
            workflow_module = f"toolbox.workflows.{workflow_dir.name}.workflow"
            module = __import__(workflow_module, fromlist=[''])

            # Find @workflow.defn decorated classes
            for name, obj in inspect.getmembers(module, inspect.isclass):
                if hasattr(obj, '__temporal_workflow_definition'):
                    workflows.append(obj)
                    logger.info(f"Discovered workflow: {name} for vertical {vertical}")

    return workflows

async def main():
    vertical = os.getenv("WORKER_VERTICAL", "android")
    temporal_address = os.getenv("TEMPORAL_ADDRESS", "localhost:7233")

    # Discover workflows for this vertical
    workflows = await discover_workflows(vertical)

    if not workflows:
        logger.warning(f"No workflows found for vertical: {vertical}")
        return

    # Connect to Temporal
    client = await Client.connect(temporal_address)

    # Start worker with discovered workflows
    worker = Worker(
        client,
        task_queue=f"{vertical}-queue",
        workflows=workflows,
        activities=[
            get_target_activity,
            cleanup_cache_activity,
            # ... vertical-specific activities
        ]
    )

    logger.info(f"Worker started for vertical '{vertical}' with {len(workflows)} workflows")
    await worker.run()

if __name__ == "__main__":
    asyncio.run(main())
```

### Workflow Declaration

```yaml
# toolbox/workflows/android_apk_analysis/metadata.yaml
name: android_apk_analysis
version: 1.0.0
description: "Deep analysis of Android APK files"
vertical: android  # ← Routes to worker-android
dependencies:
  python:
    - androguard==4.1.0  # Additional Python deps (optional)
    - pyaxmlparser==0.3.28
```

```python
# toolbox/workflows/android_apk_analysis/workflow.py
from temporalio import workflow
from datetime import timedelta

@workflow.defn
class AndroidApkAnalysisWorkflow:
    """
    Comprehensive Android APK security analysis
    Runs in worker-android with apktool, Frida, jadx pre-installed
    """

    @workflow.run
    async def run(self, target_id: str) -> dict:
        # Activity 1: Download target from MinIO
        apk_path = await workflow.execute_activity(
            "get_target",
            target_id,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Activity 2: Extract manifest (uses apktool - pre-installed)
        manifest = await workflow.execute_activity(
            "extract_manifest",
            apk_path,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Activity 3: Static analysis (uses jadx - pre-installed)
        static_results = await workflow.execute_activity(
            "static_analysis",
            apk_path,
            start_to_close_timeout=timedelta(minutes=30)
        )

        # Activity 4: Frida instrumentation (uses Frida - pre-installed)
        dynamic_results = await workflow.execute_activity(
            "dynamic_analysis",
            apk_path,
            start_to_close_timeout=timedelta(hours=2)
        )

        # Activity 5: Cleanup local cache
        await workflow.execute_activity(
            "cleanup_cache",
            apk_path,
            start_to_close_timeout=timedelta(minutes=1)
        )

        return {
            "manifest": manifest,
            "static": static_results,
            "dynamic": dynamic_results
        }
```

---

## Storage Strategy (MinIO)

### Why MinIO?

**Goal:** Unified storage that works identically in dev and production, eliminating environment-specific code.

**Alternatives considered:**
1. ❌ **LocalVolumeStorage** (mount /Users, /home): Security risk, platform-specific, doesn't scale
2. ❌ **Different storage per environment**: Complex, error-prone, dual maintenance
3. ✅ **MinIO everywhere**: Lightweight (+256MB), S3-compatible, multi-host ready

### MinIO Configuration

```yaml
# docker-compose.yaml
services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"   # S3 API
      - "9001:9001"   # Web Console (http://localhost:9001)
    volumes:
      - minio_data:/data
    environment:
      MINIO_ROOT_USER: fuzzforge
      MINIO_ROOT_PASSWORD: fuzzforge123
      MINIO_CI_CD: "true"  # Reduces memory to 256MB (from 1GB)
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 5s
      timeout: 5s
      retries: 5

  # One-time setup: create buckets and set lifecycle policies
  minio-setup:
    image: minio/mc:latest
    depends_on:
      minio:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c "
      mc alias set fuzzforge http://minio:9000 fuzzforge fuzzforge123;
      mc mb fuzzforge/targets --ignore-existing;
      mc mb fuzzforge/results --ignore-existing;
      mc ilm add fuzzforge/targets --expiry-days 7;
      mc anonymous set download fuzzforge/results;
      "
```

### Storage Backend Implementation

```python
# backend/src/storage/s3_cached.py
import boto3
from pathlib import Path
from datetime import datetime, timedelta
import logging

logger = logging.getLogger(__name__)

class S3CachedStorage:
    """
    S3-compatible storage with local caching.
    Works with MinIO (dev/prod) or AWS S3 (cloud).
    """

    def __init__(self):
        self.s3 = boto3.client(
            's3',
            endpoint_url=os.getenv('S3_ENDPOINT', 'http://minio:9000'),
            aws_access_key_id=os.getenv('S3_ACCESS_KEY', 'fuzzforge'),
            aws_secret_access_key=os.getenv('S3_SECRET_KEY', 'fuzzforge123')
        )
        self.bucket = os.getenv('S3_BUCKET', 'targets')
        self.cache_dir = Path(os.getenv('CACHE_DIR', '/cache'))
        self.cache_max_size = self._parse_size(os.getenv('CACHE_MAX_SIZE', '10GB'))
        self.cache_ttl = self._parse_duration(os.getenv('CACHE_TTL', '7d'))

    async def upload_target(self, file_path: Path, user_id: str) -> str:
        """Upload target to MinIO and return target ID"""
        target_id = str(uuid4())

        # Upload with metadata for lifecycle management
        self.s3.upload_file(
            str(file_path),
            self.bucket,
            f'{target_id}/target',
            ExtraArgs={
                'Metadata': {
                    'user_id': user_id,
                    'uploaded_at': datetime.now().isoformat(),
                    'filename': file_path.name
                }
            }
        )

        logger.info(f"Uploaded target {target_id} ({file_path.name})")
        return target_id

    async def get_target(self, target_id: str) -> Path:
        """
        Get target from cache or download from MinIO.
        Returns local path to cached file.
        """
        cache_path = self.cache_dir / target_id
        cached_file = cache_path / "target"

        # Check cache
        if cached_file.exists():
            # Update access time for LRU
            cached_file.touch()
            logger.info(f"Cache hit: {target_id}")
            return cached_file

        # Cache miss - download from MinIO
        logger.info(f"Cache miss: {target_id}, downloading from MinIO")
        cache_path.mkdir(parents=True, exist_ok=True)

        self.s3.download_file(
            self.bucket,
            f'{target_id}/target',
            str(cached_file)
        )

        return cached_file

    async def cleanup_cache(self):
        """LRU eviction when cache exceeds max size"""
        cache_files = []
        total_size = 0

        for cache_file in self.cache_dir.rglob('*'):
            if cache_file.is_file():
                stat = cache_file.stat()
                cache_files.append({
                    'path': cache_file,
                    'size': stat.st_size,
                    'atime': stat.st_atime
                })
                total_size += stat.st_size

        if total_size > self.cache_max_size:
            # Sort by access time (oldest first)
            cache_files.sort(key=lambda x: x['atime'])

            for file_info in cache_files:
                if total_size <= self.cache_max_size:
                    break

                file_info['path'].unlink()
                total_size -= file_info['size']
                logger.info(f"Evicted from cache: {file_info['path']}")
```

### Performance Characteristics

| Operation | Direct Filesystem | MinIO (Local) | Impact |
|-----------|------------------|---------------|---------|
| Small file (<1MB) | ~1ms | ~5-10ms | Negligible for security workflows |
| Large file (>100MB) | ~200ms | ~220ms | ~10% overhead |
| Workflow duration | 5-60 minutes | 5-60 minutes + 2-4s upload | <1% overhead |
| Subsequent scans | Same | **Cached (0ms)** | Better than filesystem |

**Verdict:** 2-4 second upload overhead is **negligible** for workflows that run 5-60 minutes.

### Workspace Isolation

To support concurrent workflows safely, FuzzForge implements workspace isolation with three modes:

**1. Isolated Mode (Default)**
```python
# Each workflow run gets its own workspace
cache_path = f"/cache/{target_id}/{run_id}/workspace/"
```

- **Use for:** Fuzzing workflows that modify files (corpus, crashes)
- **Advantages:** Safe for concurrent execution, no file conflicts
- **Cleanup:** Entire run directory removed after workflow completes

**2. Shared Mode**
```python
# All runs share the same workspace
cache_path = f"/cache/{target_id}/workspace/"
```

- **Use for:** Read-only analysis workflows (security scanning, static analysis)
- **Advantages:** Efficient (downloads once), lower bandwidth/storage
- **Cleanup:** No cleanup (workspace persists for reuse)

**3. Copy-on-Write Mode**
```python
# Download once to shared location, copy per run
shared_cache = f"/cache/{target_id}/shared/workspace/"
run_cache = f"/cache/{target_id}/{run_id}/workspace/"
```

- **Use for:** Large targets that need isolation
- **Advantages:** Download once, isolated per-run execution
- **Cleanup:** Run-specific copies removed, shared cache persists

**Configuration:**

Workflows specify isolation mode in `metadata.yaml`:
```yaml
name: atheris_fuzzing
workspace_isolation: "isolated"  # or "shared" or "copy-on-write"
```

Workers automatically handle download, extraction, and cleanup based on the mode.

---

## Dynamic Workflow Loading

### The Problem

**Requirement:** Workflows must be dynamically added without modifying the codebase or rebuilding Docker images.

**Traditional approach (doesn't work):**
- Build Docker image per workflow with dependencies
- Push to registry
- Worker pulls and spawns container
- ❌ Requires rebuild for every workflow change
- ❌ Registry overhead
- ❌ Slow (5-10s startup per workflow)

**Our approach (works):**
- Workflow code mounted as volume into long-lived workers
- Workers scan `/app/toolbox/workflows` at startup
- Dynamically import and register workflows matching vertical
- ✅ No rebuild needed
- ✅ No registry
- ✅ Zero startup overhead

### Implementation

**1. Docker Compose volume mount:**
```yaml
worker-android:
  volumes:
    - ./toolbox:/app/toolbox:ro  # Mount workflow code as read-only
```

**2. Worker discovers workflows:**
```python
# Runs at worker startup
for workflow_dir in Path("/app/toolbox/workflows").iterdir():
    metadata = yaml.safe_load((workflow_dir / "metadata.yaml").read_text())

    # Only load workflows for this vertical
    if metadata.get("vertical") == os.getenv("WORKER_VERTICAL"):
        # Dynamically import workflow.py
        module = importlib.import_module(f"toolbox.workflows.{workflow_dir.name}.workflow")

        # Find @workflow.defn classes
        workflows.append(module.MyWorkflowClass)
```

**3. Developer adds workflow:**
```bash
# 1. Create workflow directory
mkdir -p toolbox/workflows/my_new_workflow

# 2. Write metadata
cat > toolbox/workflows/my_new_workflow/metadata.yaml <<EOF
vertical: android
EOF

# 3. Write workflow
cat > toolbox/workflows/my_new_workflow/workflow.py <<EOF
from temporalio import workflow

@workflow.defn
class MyNewWorkflow:
    @workflow.run
    async def run(self, target_id: str):
        # Implementation
        pass
EOF

# 4. Restart worker to pick up new workflow
docker-compose restart worker-android

# Done! No image building, no registry push
```

### Hot Reload (Optional Advanced Feature)

```python
# Worker watches /app/toolbox for file changes
import watchdog

observer = Observer()
observer.schedule(WorkflowReloadHandler(), "/app/toolbox/workflows", recursive=True)
observer.start()

class WorkflowReloadHandler(FileSystemEventHandler):
    def on_modified(self, event):
        if event.src_path.endswith(".py"):
            logger.info(f"Detected workflow change: {event.src_path}")
            # Reload workflow without restarting worker
            reload_workflows()
```

---

## Architecture Principles

### 1. Vertical Specialization

**Principle:** Each worker is specialized for a security domain with pre-built toolchains.

**Benefits:**
- Faster workflow execution (tools already installed)
- Better resource utilization (long-lived workers)
- Clear marketing positioning (sell verticals, not orchestration)
- Easier development (known toolchain per vertical)

### 2. Unified Storage

**Principle:** Same storage backend (MinIO) in development and production.

**Benefits:**
- No environment-specific code
- Easier testing (dev = prod)
- Multi-host ready from day one
- Better security (no host mounts)

### 3. Dynamic Workflow Discovery

**Principle:** Workflows are discovered and loaded at runtime, not compile-time.

**Benefits:**
- Add workflows without rebuilding images
- No registry overhead
- Faster iteration for developers
- Supports user-contributed workflows

### 4. Environment-Driven Configuration

**Principle:** All configuration via environment variables, no hardcoded values.

**Required Variables:**
```bash
# Worker configuration
TEMPORAL_ADDRESS=temporal:7233
WORKER_VERTICAL=android
MAX_CONCURRENT_ACTIVITIES=5

# Storage configuration
S3_ENDPOINT=http://minio:9000
S3_ACCESS_KEY=fuzzforge
S3_SECRET_KEY=fuzzforge123
S3_BUCKET=targets

# Cache configuration
CACHE_DIR=/cache
CACHE_MAX_SIZE=10GB
CACHE_TTL=7d
```

### 5. Fail-Safe Defaults

**Principle:** System works out-of-the-box with sensible defaults.

**Examples:**
- MinIO CI_CD mode (lightweight for dev)
- 7-day lifecycle policy (auto-cleanup)
- 10GB cache limit (prevent disk exhaustion)
- Embedded SQLite for Temporal (no Postgres in dev)

---

## Component Details

### Temporal Server

**Deployment Options:**

| Environment | Database | Memory | Notes |
|-------------|----------|--------|-------|
| Development | SQLite (embedded) | 500MB | Simple, no external DB |
| Production | PostgreSQL | 2GB | Clustered for HA |

**Configuration:**
```yaml
temporal:
  image: temporalio/auto-setup:latest
  ports:
    - "7233:7233"  # gRPC
    - "8233:8233"  # Web UI
  environment:
    - DB=sqlite  # or postgresql for prod
    - SQLITE_PRAGMA_journal_mode=WAL
```

### MinIO

**Resource Usage:**
- Memory: 256MB (CI_CD mode) to 1GB (production)
- CPU: Minimal (I/O bound)
- Disk: Depends on usage (recommend 100GB+)

**Configuration:**
```yaml
minio:
  image: minio/minio:latest
  environment:
    MINIO_CI_CD: "true"  # Lightweight mode
    MINIO_ROOT_USER: fuzzforge
    MINIO_ROOT_PASSWORD: fuzzforge123
```

**Web Console:** http://localhost:9001

### Vertical Workers

**Base Requirements:**
- Python 3.11+
- Temporal Python SDK
- boto3 (S3 client)
- Domain-specific tools

**Scaling:**
```yaml
# Scale vertically (more concurrent activities per worker)
environment:
  MAX_CONCURRENT_ACTIVITIES: 10  # Default: 5

# Scale horizontally (more workers)
docker-compose up -d --scale worker-android=3
```

---

## Scaling Strategy

### Phase 1: Single Host (Now - 6 months)

**Configuration:**
```yaml
# 1 Temporal + 1 MinIO + 3-5 vertical workers
# Capacity: 15-50 concurrent workflows
# Cost: ~$430/month
```

**When to move to Phase 2:** Saturating single host (CPU >80%, memory >90%)

### Phase 2: Multi-Host (6-18 months)

**Configuration:**
```
Host 1: Temporal + MinIO
Host 2: 5× worker-android
Host 3: 5× worker-rust
Host 4: 5× worker-web
```

**Changes required:**
```yaml
# Point all workers to central Temporal/MinIO
environment:
  TEMPORAL_ADDRESS: temporal.prod.fuzzforge.ai:7233
  S3_ENDPOINT: http://minio.prod.fuzzforge.ai:9000
```

**Capacity:** 3× Phase 1 = 45-150 concurrent workflows

### Phase 3: Nomad Cluster (18+ months, if needed)

**Trigger Points:**
- Managing 10+ hosts manually
- Need auto-scaling based on queue depth
- Need multi-tenancy (customer namespaces)

**Migration effort:** 1-2 weeks (workers unchanged, just change deployment method)

---

## File Lifecycle Management

### Automatic Cleanup via MinIO Lifecycle Policies

```bash
# Set on bucket (done by minio-setup service)
mc ilm add fuzzforge/targets --expiry-days 7

# MinIO automatically deletes objects older than 7 days
```

### Local Cache Eviction (LRU)

```python
# Worker background task (runs every 30 minutes)
async def cleanup_cache_task():
    while True:
        await storage.cleanup_cache()  # LRU eviction
        await asyncio.sleep(1800)  # 30 minutes
```

### Manual Deletion (API)

```python
@app.delete("/api/targets/{target_id}")
async def delete_target(target_id: str):
    """Allow users to manually delete uploaded targets"""
    s3.delete_object(Bucket='targets', Key=f'{target_id}/target')
    return {"status": "deleted"}
```

### Retention Policies

| Object Type | Default TTL | Configurable | Notes |
|-------------|-------------|--------------|-------|
| Uploaded targets | 7 days | Yes (env var) | Auto-deleted by MinIO |
| Worker cache | LRU (10GB limit) | Yes | Evicted when cache full |
| Workflow results | 30 days (optional) | Yes | Can store in MinIO |

---

## Future: Nomad Migration

### When to Add Nomad?

**Trigger points:**
- Managing 10+ hosts manually becomes painful
- Need auto-scaling based on queue depth
- Need multi-tenancy with resource quotas
- Want sophisticated scheduling (bin-packing, affinity rules)

**Estimated timing:** 18-24 months

### Migration Complexity

**Effort:** 1-2 weeks

**What changes:**
- Deployment method (docker-compose → Nomad jobs)
- Orchestration layer (manual → Nomad scheduler)

**What stays the same:**
- Worker Docker images (unchanged)
- Workflows (unchanged)
- Temporal (unchanged)
- MinIO (unchanged)
- Storage backend (unchanged)

### Nomad Job Example

```hcl
job "fuzzforge-worker-android" {
  datacenters = ["dc1"]
  type = "service"

  group "workers" {
    count = 5  # Auto-scales based on queue depth

    scaling {
      min = 1
      max = 20

      policy {
        evaluation_interval = "30s"

        check "queue_depth" {
          source = "prometheus"
          query  = "temporal_queue_depth{queue='android-queue'}"

          strategy "target-value" {
            target = 10  # Scale up if >10 tasks queued
          }
        }
      }
    }

    task "worker" {
      driver = "docker"

      config {
        image = "fuzzforge/worker-android:latest"

        volumes = [
          "/opt/fuzzforge/toolbox:/app/toolbox:ro"
        ]
      }

      env {
        TEMPORAL_ADDRESS = "temporal.service.consul:7233"
        WORKER_VERTICAL  = "android"
        S3_ENDPOINT      = "http://minio.service.consul:9000"
      }

      resources {
        cpu    = 500  # MHz
        memory = 512  # MB
      }
    }
  }
}
```

### Licensing Considerations

**Nomad BSL 1.1 Risk:** Depends on FuzzForge positioning

**Safe positioning (LOW risk):**
- ✅ Market as "Android/Rust/Web security verticals"
- ✅ Emphasize domain expertise, not orchestration
- ✅ Nomad is internal infrastructure
- ✅ Customers buy security services, not Nomad

**Risky positioning (MEDIUM risk):**
- ⚠️ Market as "generic workflow orchestration platform"
- ⚠️ Emphasize flexibility over domain expertise
- ⚠️ Could be seen as competing with HashiCorp

**Mitigation:**
- Keep marketing focused on security verticals
- Get legal review before Phase 3
- Alternative: Use Kubernetes (Apache 2.0, zero risk)

---

## Migration Timeline

### Phase 1: Foundation (Weeks 1-2)
- ✅ Create feature branch
- Set up Temporal docker-compose
- Add MinIO service
- Implement S3CachedStorage backend
- Create cleanup/lifecycle logic

### Phase 2: First Vertical Worker (Weeks 3-4)
- Design worker base template
- Create worker-rust with AFL++, cargo-fuzz
- Implement dynamic workflow discovery
- Test workflow loading from mounted volume

### Phase 3: Migrate Workflows (Weeks 5-6)
- Port security_assessment workflow to Temporal
- Update workflow metadata format
- Test end-to-end flow (upload → analyze → results)
- Verify cleanup/lifecycle

### Phase 4: Additional Verticals (Weeks 7-8)
- Create worker-android, worker-web
- Document vertical development guide
- Update CLI for MinIO uploads
- Update backend API for Temporal

### Phase 5: Testing & Docs (Weeks 9-10)
- Comprehensive testing
- Update README
- Migration guide for existing users
- Troubleshooting documentation

**Total: 10 weeks, rollback possible at any phase**

---

## Decision Log

### 2025-09-30: Initial Architecture Decision
- **Decision:** Migrate from Prefect to Temporal
- **Rationale:** Simpler infrastructure, better reliability, clear scaling path

### 2025-10-01: Vertical Worker Model
- **Decision:** Use long-lived vertical workers instead of ephemeral per-workflow containers
- **Rationale:**
  - Zero startup overhead (5s saved per workflow)
  - Pre-built toolchains (Android, Rust, Web, etc.)
  - Dynamic workflows via mounted volumes (no image rebuild)
  - Better marketing (sell verticals, not orchestration)
  - Safer Nomad BSL positioning

### 2025-10-01: Unified MinIO Storage
- **Decision:** Use MinIO for both dev and production (no LocalVolumeStorage)
- **Rationale:**
  - Unified codebase (no environment-specific code)
  - Lightweight (256MB with CI_CD=true)
  - Negligible overhead (2-4s for 250MB upload)
  - Better security (no host filesystem mounts)
  - Multi-host ready
  - Automatic cleanup via lifecycle policies

### 2025-10-01: Dynamic Workflow Loading
- **Decision:** Mount workflow code as volume, discover at runtime
- **Rationale:**
  - Add workflows without rebuilding images
  - No registry overhead
  - Supports user-contributed workflows
  - Faster iteration for developers

---

**Document Version:** 2.0
**Last Updated:** 2025-10-01
**Next Review:** After Phase 1 implementation (2 weeks)