mirror of https://github.com/FuzzingLabs/fuzzforge_ai.git synced 2026-02-12 20:32:46 +00:00

Files

tduhamel42 40d48a8045 feat: Complete Temporal migration cleanup and fixes

- Remove obsolete docker_logs.py module and container diagnostics from SDK
- Fix security_assessment workflow metadata (vertical: rust -> python)
- Remove all Prefect references from documentation
- Add SDK exception handling test suite
- Clean up old test artifacts

2025-10-14 15:02:52 +02:00

31 KiB

Raw Blame History

FuzzForge AI Architecture

Last Updated: 2025-10-14 Status: Production - Temporal with Vertical Workers

Executive Summary
Current Architecture (Temporal + Vertical Workers)
Vertical Worker Model
Storage Strategy (MinIO)
Dynamic Workflow Loading
Architecture Principles
Component Details
Scaling Strategy
File Lifecycle Management
Future: Nomad Migration

Executive Summary

The Architecture

Temporal orchestration with a vertical worker architecture where each worker is pre-built with domain-specific security toolchains (Android, Rust, Web, iOS, Blockchain, OSS-Fuzz, etc.). Uses MinIO for unified S3-compatible storage across dev and production environments.

Key Architecture Features

Vertical Specialization: Pre-built toolchains (Android: Frida, apktool; Rust: AFL++, cargo-fuzz)
Zero Startup Overhead: Long-lived workers (no container spawn per workflow)
Dynamic Workflows: Add workflows without rebuilding images (mount as volume)
Unified Storage: MinIO works identically in dev and prod
Better Security: No host filesystem mounts, isolated uploaded targets
Automatic Cleanup: MinIO lifecycle policies handle file expiration
Scalability: Clear path from single-host to multi-host to Nomad cluster

Current Architecture (Temporal + Vertical Workers)

Infrastructure Overview

┌───────────────────────────────────────────────────────────────┐
│ FuzzForge Platform                                            │
│                                                                │
│  ┌──────────────────┐         ┌─────────────────────────┐   │
│  │ Temporal Server  │◄────────│ MinIO (S3 Storage)      │   │
│  │ - Workflows      │         │ - Uploaded targets      │   │
│  │ - State mgmt     │         │ - Results (optional)    │   │
│  │ - Task queues    │         │ - Lifecycle policies    │   │
│  └────────┬─────────┘         └─────────────────────────┘   │
│           │                                                    │
│           │ (Task queue routing)                              │
│           │                                                    │
│  ┌────────┴────────────────────────────────────────────────┐ │
│  │ Vertical Workers (Long-lived)                            │ │
│  │                                                          │ │
│  │  ┌───────────────┐  ┌───────────────┐  ┌─────────────┐│ │
│  │  │ Android       │  │ Rust/Native   │  │ Web/JS      ││ │
│  │  │ - apktool     │  │ - AFL++       │  │ - Node.js   ││ │
│  │  │ - Frida       │  │ - cargo-fuzz  │  │ - OWASP ZAP ││ │
│  │  │ - jadx        │  │ - gdb         │  │ - semgrep   ││ │
│  │  │ - MobSF       │  │ - valgrind    │  │ - eslint    ││ │
│  │  └───────────────┘  └───────────────┘  └─────────────┘│ │
│  │                                                          │ │
│  │  ┌───────────────┐  ┌───────────────┐                  │ │
│  │  │ iOS           │  │ Blockchain    │                  │ │
│  │  │ - class-dump  │  │ - mythril     │                  │ │
│  │  │ - Clutch      │  │ - slither     │                  │ │
│  │  │ - Frida       │  │ - echidna     │                  │ │
│  │  │ - Hopper      │  │ - manticore   │                  │ │
│  │  └───────────────┘  └───────────────┘                  │ │
│  │                                                          │ │
│  │  All workers have:                                       │ │
│  │  - /app/toolbox mounted (workflow code)                 │ │
│  │  - /cache for MinIO downloads                           │ │
│  │  - Dynamic workflow discovery at startup                │ │
│  └──────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘

Service Breakdown

services:
  temporal:         # Workflow orchestration + embedded SQLite (dev) or Postgres (prod)
  minio:            # S3-compatible storage for targets and results
  minio-setup:      # One-time: create buckets, set policies
  worker-android:   # Android security vertical (scales independently)
  worker-rust:      # Rust/native security vertical
  worker-web:       # Web security vertical
  # Additional verticals as needed: ios, blockchain, go, etc.

Total: 6+ services (scales with verticals)

Resource Usage

Temporal:        ~500MB  (includes embedded DB in dev)
MinIO:           ~256MB  (with CI_CD=true flag)
MinIO-setup:     ~20MB   (ephemeral, exits after setup)
Worker-android:  ~512MB  (varies by toolchain)
Worker-rust:     ~512MB
Worker-web:      ~512MB
─────────────────────────
Total:           ~2.3GB

Note: +450MB overhead is worth it for:
  - Unified dev/prod architecture
  - No host filesystem mounts (security)
  - Auto cleanup (lifecycle policies)
  - Multi-host ready

Vertical Worker Model

Concept

Instead of generic workers that spawn workflow-specific containers, we have specialized long-lived workers pre-built with complete security toolchains for specific domains.

Vertical Taxonomy

Vertical	Tools Included	Use Cases	Workflows
android	apktool, jadx, Frida, MobSF, androguard	APK analysis, reverse engineering, dynamic instrumentation	APK security assessment, malware analysis, repackaging detection
rust	AFL++, cargo-fuzz, gdb, valgrind, AddressSanitizer	Native fuzzing, memory safety	Cargo fuzzing campaigns, binary analysis
web	Node.js, OWASP ZAP, Burp Suite, semgrep, eslint	Web app security testing	XSS detection, SQL injection scanning, API fuzzing
ios	class-dump, Clutch, Frida, Hopper, ios-deploy	iOS app analysis	IPA analysis, jailbreak detection, runtime hooking
blockchain	mythril, slither, echidna, manticore, solc	Smart contract security	Solidity static analysis, property-based fuzzing
go	go-fuzz, staticcheck, gosec, dlv	Go security testing	Go fuzzing, static analysis

Vertical Worker Architecture

# Example: workers/android/Dockerfile
FROM python:3.11-slim

# Install Android SDK and tools
RUN apt-get update && apt-get install -y \
    openjdk-17-jdk \
    android-sdk \
    && rm -rf /var/lib/apt/lists/*

# Install security tools
RUN pip install --no-cache-dir \
    apktool \
    androguard \
    frida-tools \
    pyaxmlparser

# Install MobSF dependencies
RUN apt-get update && apt-get install -y \
    libxml2-dev \
    libxslt-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Temporal Python SDK
RUN pip install --no-cache-dir \
    temporalio \
    boto3 \
    pydantic

# Copy worker entrypoint
COPY worker.py /app/
WORKDIR /app

# Worker will mount /app/toolbox and discover workflows at runtime
CMD ["python", "worker.py"]

Dynamic Workflow Discovery

# workers/android/worker.py
import asyncio
from pathlib import Path
from temporalio.client import Client
from temporalio.worker import Worker

async def discover_workflows(vertical: str):
    """Discover workflows for this vertical from mounted toolbox"""
    workflows = []
    toolbox = Path("/app/toolbox/workflows")

    for workflow_dir in toolbox.iterdir():
        if not workflow_dir.is_dir():
            continue

        metadata_file = workflow_dir / "metadata.yaml"
        if not metadata_file.exists():
            continue

        # Parse metadata
        with open(metadata_file) as f:
            metadata = yaml.safe_load(f)

        # Check if workflow is for this vertical
        if metadata.get("vertical") == vertical:
            # Dynamically import workflow module
            workflow_module = f"toolbox.workflows.{workflow_dir.name}.workflow"
            module = __import__(workflow_module, fromlist=[''])

            # Find @workflow.defn decorated classes
            for name, obj in inspect.getmembers(module, inspect.isclass):
                if hasattr(obj, '__temporal_workflow_definition'):
                    workflows.append(obj)
                    logger.info(f"Discovered workflow: {name} for vertical {vertical}")

    return workflows

async def main():
    vertical = os.getenv("WORKER_VERTICAL", "android")
    temporal_address = os.getenv("TEMPORAL_ADDRESS", "localhost:7233")

    # Discover workflows for this vertical
    workflows = await discover_workflows(vertical)

    if not workflows:
        logger.warning(f"No workflows found for vertical: {vertical}")
        return

    # Connect to Temporal
    client = await Client.connect(temporal_address)

    # Start worker with discovered workflows
    worker = Worker(
        client,
        task_queue=f"{vertical}-queue",
        workflows=workflows,
        activities=[
            get_target_activity,
            cleanup_cache_activity,
            # ... vertical-specific activities
        ]
    )

    logger.info(f"Worker started for vertical '{vertical}' with {len(workflows)} workflows")
    await worker.run()

if __name__ == "__main__":
    asyncio.run(main())

Workflow Declaration

# toolbox/workflows/android_apk_analysis/metadata.yaml
name: android_apk_analysis
version: 1.0.0
description: "Deep analysis of Android APK files"
vertical: android  # ← Routes to worker-android
dependencies:
  python:
    - androguard==4.1.0  # Additional Python deps (optional)
    - pyaxmlparser==0.3.28

# toolbox/workflows/android_apk_analysis/workflow.py
from temporalio import workflow
from datetime import timedelta

@workflow.defn
class AndroidApkAnalysisWorkflow:
    """
    Comprehensive Android APK security analysis
    Runs in worker-android with apktool, Frida, jadx pre-installed
    """

    @workflow.run
    async def run(self, target_id: str) -> dict:
        # Activity 1: Download target from MinIO
        apk_path = await workflow.execute_activity(
            "get_target",
            target_id,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Activity 2: Extract manifest (uses apktool - pre-installed)
        manifest = await workflow.execute_activity(
            "extract_manifest",
            apk_path,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Activity 3: Static analysis (uses jadx - pre-installed)
        static_results = await workflow.execute_activity(
            "static_analysis",
            apk_path,
            start_to_close_timeout=timedelta(minutes=30)
        )

        # Activity 4: Frida instrumentation (uses Frida - pre-installed)
        dynamic_results = await workflow.execute_activity(
            "dynamic_analysis",
            apk_path,
            start_to_close_timeout=timedelta(hours=2)
        )

        # Activity 5: Cleanup local cache
        await workflow.execute_activity(
            "cleanup_cache",
            apk_path,
            start_to_close_timeout=timedelta(minutes=1)
        )

        return {
            "manifest": manifest,
            "static": static_results,
            "dynamic": dynamic_results
        }

Storage Strategy (MinIO)

Why MinIO?

Goal: Unified storage that works identically in dev and production, eliminating environment-specific code.

Alternatives considered:

❌ LocalVolumeStorage (mount /Users, /home): Security risk, platform-specific, doesn't scale
❌ Different storage per environment: Complex, error-prone, dual maintenance
✅ MinIO everywhere: Lightweight (+256MB), S3-compatible, multi-host ready

MinIO Configuration

# docker-compose.yaml
services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"   # S3 API
      - "9001:9001"   # Web Console (http://localhost:9001)
    volumes:
      - minio_data:/data
    environment:
      MINIO_ROOT_USER: fuzzforge
      MINIO_ROOT_PASSWORD: fuzzforge123
      MINIO_CI_CD: "true"  # Reduces memory to 256MB (from 1GB)
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 5s
      timeout: 5s
      retries: 5

  # One-time setup: create buckets and set lifecycle policies
  minio-setup:
    image: minio/mc:latest
    depends_on:
      minio:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c "
      mc alias set fuzzforge http://minio:9000 fuzzforge fuzzforge123;
      mc mb fuzzforge/targets --ignore-existing;
      mc mb fuzzforge/results --ignore-existing;
      mc ilm add fuzzforge/targets --expiry-days 7;
      mc anonymous set download fuzzforge/results;
      "

Storage Backend Implementation

# backend/src/storage/s3_cached.py
import boto3
from pathlib import Path
from datetime import datetime, timedelta
import logging

logger = logging.getLogger(__name__)

class S3CachedStorage:
    """
    S3-compatible storage with local caching.
    Works with MinIO (dev/prod) or AWS S3 (cloud).
    """

    def __init__(self):
        self.s3 = boto3.client(
            's3',
            endpoint_url=os.getenv('S3_ENDPOINT', 'http://minio:9000'),
            aws_access_key_id=os.getenv('S3_ACCESS_KEY', 'fuzzforge'),
            aws_secret_access_key=os.getenv('S3_SECRET_KEY', 'fuzzforge123')
        )
        self.bucket = os.getenv('S3_BUCKET', 'targets')
        self.cache_dir = Path(os.getenv('CACHE_DIR', '/cache'))
        self.cache_max_size = self._parse_size(os.getenv('CACHE_MAX_SIZE', '10GB'))
        self.cache_ttl = self._parse_duration(os.getenv('CACHE_TTL', '7d'))

    async def upload_target(self, file_path: Path, user_id: str) -> str:
        """Upload target to MinIO and return target ID"""
        target_id = str(uuid4())

        # Upload with metadata for lifecycle management
        self.s3.upload_file(
            str(file_path),
            self.bucket,
            f'{target_id}/target',
            ExtraArgs={
                'Metadata': {
                    'user_id': user_id,
                    'uploaded_at': datetime.now().isoformat(),
                    'filename': file_path.name
                }
            }
        )

        logger.info(f"Uploaded target {target_id} ({file_path.name})")
        return target_id

    async def get_target(self, target_id: str) -> Path:
        """
        Get target from cache or download from MinIO.
        Returns local path to cached file.
        """
        cache_path = self.cache_dir / target_id
        cached_file = cache_path / "target"

        # Check cache
        if cached_file.exists():
            # Update access time for LRU
            cached_file.touch()
            logger.info(f"Cache hit: {target_id}")
            return cached_file

        # Cache miss - download from MinIO
        logger.info(f"Cache miss: {target_id}, downloading from MinIO")
        cache_path.mkdir(parents=True, exist_ok=True)

        self.s3.download_file(
            self.bucket,
            f'{target_id}/target',
            str(cached_file)
        )

        return cached_file

    async def cleanup_cache(self):
        """LRU eviction when cache exceeds max size"""
        cache_files = []
        total_size = 0

        for cache_file in self.cache_dir.rglob('*'):
            if cache_file.is_file():
                stat = cache_file.stat()
                cache_files.append({
                    'path': cache_file,
                    'size': stat.st_size,
                    'atime': stat.st_atime
                })
                total_size += stat.st_size

        if total_size > self.cache_max_size:
            # Sort by access time (oldest first)
            cache_files.sort(key=lambda x: x['atime'])

            for file_info in cache_files:
                if total_size <= self.cache_max_size:
                    break

                file_info['path'].unlink()
                total_size -= file_info['size']
                logger.info(f"Evicted from cache: {file_info['path']}")

Performance Characteristics

Operation	Direct Filesystem	MinIO (Local)	Impact
Small file (<1MB)	~1ms	~5-10ms	Negligible for security workflows
Large file (>100MB)	~200ms	~220ms	~10% overhead
Workflow duration	5-60 minutes	5-60 minutes + 2-4s upload	<1% overhead
Subsequent scans	Same	Cached (0ms)	Better than filesystem

Verdict: 2-4 second upload overhead is negligible for workflows that run 5-60 minutes.

Workspace Isolation

To support concurrent workflows safely, FuzzForge implements workspace isolation with three modes:

1. Isolated Mode (Default)

# Each workflow run gets its own workspace
cache_path = f"/cache/{target_id}/{run_id}/workspace/"

Use for: Fuzzing workflows that modify files (corpus, crashes)
Advantages: Safe for concurrent execution, no file conflicts
Cleanup: Entire run directory removed after workflow completes

2. Shared Mode

# All runs share the same workspace
cache_path = f"/cache/{target_id}/workspace/"

Use for: Read-only analysis workflows (security scanning, static analysis)
Advantages: Efficient (downloads once), lower bandwidth/storage
Cleanup: No cleanup (workspace persists for reuse)

3. Copy-on-Write Mode

# Download once to shared location, copy per run
shared_cache = f"/cache/{target_id}/shared/workspace/"
run_cache = f"/cache/{target_id}/{run_id}/workspace/"

Use for: Large targets that need isolation
Advantages: Download once, isolated per-run execution
Cleanup: Run-specific copies removed, shared cache persists

Configuration:

Workflows specify isolation mode in metadata.yaml:

name: atheris_fuzzing
workspace_isolation: "isolated"  # or "shared" or "copy-on-write"

Workers automatically handle download, extraction, and cleanup based on the mode.

Dynamic Workflow Loading

The Problem

Requirement: Workflows must be dynamically added without modifying the codebase or rebuilding Docker images.

Traditional approach (doesn't work):

Build Docker image per workflow with dependencies
Push to registry
Worker pulls and spawns container
❌ Requires rebuild for every workflow change
❌ Registry overhead
❌ Slow (5-10s startup per workflow)

Our approach (works):

Workflow code mounted as volume into long-lived workers
Workers scan /app/toolbox/workflows at startup
Dynamically import and register workflows matching vertical
✅ No rebuild needed
✅ No registry
✅ Zero startup overhead

Implementation

1. Docker Compose volume mount:

worker-android:
  volumes:
    - ./toolbox:/app/toolbox:ro  # Mount workflow code as read-only

2. Worker discovers workflows:

# Runs at worker startup
for workflow_dir in Path("/app/toolbox/workflows").iterdir():
    metadata = yaml.safe_load((workflow_dir / "metadata.yaml").read_text())

    # Only load workflows for this vertical
    if metadata.get("vertical") == os.getenv("WORKER_VERTICAL"):
        # Dynamically import workflow.py
        module = importlib.import_module(f"toolbox.workflows.{workflow_dir.name}.workflow")

        # Find @workflow.defn classes
        workflows.append(module.MyWorkflowClass)

3. Developer adds workflow:

# 1. Create workflow directory
mkdir -p toolbox/workflows/my_new_workflow

# 2. Write metadata
cat > toolbox/workflows/my_new_workflow/metadata.yaml <<EOF
vertical: android
EOF

# 3. Write workflow
cat > toolbox/workflows/my_new_workflow/workflow.py <<EOF
from temporalio import workflow

@workflow.defn
class MyNewWorkflow:
    @workflow.run
    async def run(self, target_id: str):
        # Implementation
        pass
EOF

# 4. Restart worker to pick up new workflow
docker-compose restart worker-android

# Done! No image building, no registry push

Hot Reload (Optional Advanced Feature)

# Worker watches /app/toolbox for file changes
import watchdog

observer = Observer()
observer.schedule(WorkflowReloadHandler(), "/app/toolbox/workflows", recursive=True)
observer.start()

class WorkflowReloadHandler(FileSystemEventHandler):
    def on_modified(self, event):
        if event.src_path.endswith(".py"):
            logger.info(f"Detected workflow change: {event.src_path}")
            # Reload workflow without restarting worker
            reload_workflows()

Architecture Principles

1. Vertical Specialization

Principle: Each worker is specialized for a security domain with pre-built toolchains.

Benefits:

Faster workflow execution (tools already installed)
Better resource utilization (long-lived workers)
Clear marketing positioning (sell verticals, not orchestration)
Easier development (known toolchain per vertical)

2. Unified Storage

Principle: Same storage backend (MinIO) in development and production.

Benefits:

No environment-specific code
Easier testing (dev = prod)
Multi-host ready from day one
Better security (no host mounts)

3. Dynamic Workflow Discovery

Principle: Workflows are discovered and loaded at runtime, not compile-time.

Benefits:

Add workflows without rebuilding images
No registry overhead
Faster iteration for developers
Supports user-contributed workflows

4. Environment-Driven Configuration

Principle: All configuration via environment variables, no hardcoded values.

Required Variables:

# Worker configuration
TEMPORAL_ADDRESS=temporal:7233
WORKER_VERTICAL=android
MAX_CONCURRENT_ACTIVITIES=5

# Storage configuration
S3_ENDPOINT=http://minio:9000
S3_ACCESS_KEY=fuzzforge
S3_SECRET_KEY=fuzzforge123
S3_BUCKET=targets

# Cache configuration
CACHE_DIR=/cache
CACHE_MAX_SIZE=10GB
CACHE_TTL=7d

5. Fail-Safe Defaults

Principle: System works out-of-the-box with sensible defaults.

Examples:

MinIO CI_CD mode (lightweight for dev)
7-day lifecycle policy (auto-cleanup)
10GB cache limit (prevent disk exhaustion)
Embedded SQLite for Temporal (no Postgres in dev)

Component Details

Temporal Server

Deployment Options:

Environment	Database	Memory	Notes
Development	SQLite (embedded)	500MB	Simple, no external DB
Production	PostgreSQL	2GB	Clustered for HA

Configuration:

temporal:
  image: temporalio/auto-setup:latest
  ports:
    - "7233:7233"  # gRPC
    - "8233:8233"  # Web UI
  environment:
    - DB=sqlite  # or postgresql for prod
    - SQLITE_PRAGMA_journal_mode=WAL

MinIO

Resource Usage:

Memory: 256MB (CI_CD mode) to 1GB (production)
CPU: Minimal (I/O bound)
Disk: Depends on usage (recommend 100GB+)

Configuration:

minio:
  image: minio/minio:latest
  environment:
    MINIO_CI_CD: "true"  # Lightweight mode
    MINIO_ROOT_USER: fuzzforge
    MINIO_ROOT_PASSWORD: fuzzforge123

Web Console: http://localhost:9001

Vertical Workers

Base Requirements:

Python 3.11+
Temporal Python SDK
boto3 (S3 client)
Domain-specific tools

Scaling:

# Scale vertically (more concurrent activities per worker)
environment:
  MAX_CONCURRENT_ACTIVITIES: 10  # Default: 5

# Scale horizontally (more workers)
docker-compose up -d --scale worker-android=3

Scaling Strategy

Phase 1: Single Host (Now - 6 months)

Configuration:

# 1 Temporal + 1 MinIO + 3-5 vertical workers
# Capacity: 15-50 concurrent workflows
# Cost: ~$430/month

When to move to Phase 2: Saturating single host (CPU >80%, memory >90%)

Phase 2: Multi-Host (6-18 months)

Configuration:

Host 1: Temporal + MinIO
Host 2: 5× worker-android
Host 3: 5× worker-rust
Host 4: 5× worker-web

Changes required:

# Point all workers to central Temporal/MinIO
environment:
  TEMPORAL_ADDRESS: temporal.prod.fuzzforge.ai:7233
  S3_ENDPOINT: http://minio.prod.fuzzforge.ai:9000

Capacity: 3× Phase 1 = 45-150 concurrent workflows

Phase 3: Nomad Cluster (18+ months, if needed)

Trigger Points:

Managing 10+ hosts manually
Need auto-scaling based on queue depth
Need multi-tenancy (customer namespaces)

Migration effort: 1-2 weeks (workers unchanged, just change deployment method)

File Lifecycle Management

Automatic Cleanup via MinIO Lifecycle Policies

# Set on bucket (done by minio-setup service)
mc ilm add fuzzforge/targets --expiry-days 7

# MinIO automatically deletes objects older than 7 days

Local Cache Eviction (LRU)

# Worker background task (runs every 30 minutes)
async def cleanup_cache_task():
    while True:
        await storage.cleanup_cache()  # LRU eviction
        await asyncio.sleep(1800)  # 30 minutes

Manual Deletion (API)

@app.delete("/api/targets/{target_id}")
async def delete_target(target_id: str):
    """Allow users to manually delete uploaded targets"""
    s3.delete_object(Bucket='targets', Key=f'{target_id}/target')
    return {"status": "deleted"}

Retention Policies

Object Type	Default TTL	Configurable	Notes
Uploaded targets	7 days	Yes (env var)	Auto-deleted by MinIO
Worker cache	LRU (10GB limit)	Yes	Evicted when cache full
Workflow results	30 days (optional)	Yes	Can store in MinIO

Future: Nomad Migration

When to Add Nomad?

Trigger points:

Managing 10+ hosts manually becomes painful
Need auto-scaling based on queue depth
Need multi-tenancy with resource quotas
Want sophisticated scheduling (bin-packing, affinity rules)

Estimated timing: 18-24 months

Migration Complexity

Effort: 1-2 weeks

What changes:

Deployment method (docker-compose → Nomad jobs)
Orchestration layer (manual → Nomad scheduler)

What stays the same:

Worker Docker images (unchanged)
Workflows (unchanged)
Temporal (unchanged)
MinIO (unchanged)
Storage backend (unchanged)

Nomad Job Example

job "fuzzforge-worker-android" {
  datacenters = ["dc1"]
  type = "service"

  group "workers" {
    count = 5  # Auto-scales based on queue depth

    scaling {
      min = 1
      max = 20

      policy {
        evaluation_interval = "30s"

        check "queue_depth" {
          source = "prometheus"
          query  = "temporal_queue_depth{queue='android-queue'}"

          strategy "target-value" {
            target = 10  # Scale up if >10 tasks queued
          }
        }
      }
    }

    task "worker" {
      driver = "docker"

      config {
        image = "fuzzforge/worker-android:latest"

        volumes = [
          "/opt/fuzzforge/toolbox:/app/toolbox:ro"
        ]
      }

      env {
        TEMPORAL_ADDRESS = "temporal.service.consul:7233"
        WORKER_VERTICAL  = "android"
        S3_ENDPOINT      = "http://minio.service.consul:9000"
      }

      resources {
        cpu    = 500  # MHz
        memory = 512  # MB
      }
    }
  }
}

Licensing Considerations

Nomad BSL 1.1 Risk: Depends on FuzzForge positioning

Safe positioning (LOW risk):

✅ Market as "Android/Rust/Web security verticals"
✅ Emphasize domain expertise, not orchestration
✅ Nomad is internal infrastructure
✅ Customers buy security services, not Nomad

Risky positioning (MEDIUM risk):

⚠️ Market as "generic workflow orchestration platform"
⚠️ Emphasize flexibility over domain expertise
⚠️ Could be seen as competing with HashiCorp

Mitigation:

Keep marketing focused on security verticals
Get legal review before Phase 3
Alternative: Use Kubernetes (Apache 2.0, zero risk)

Migration Timeline

Phase 1: Foundation (Weeks 1-2)

✅ Create feature branch
Set up Temporal docker-compose
Add MinIO service
Implement S3CachedStorage backend
Create cleanup/lifecycle logic

Phase 2: First Vertical Worker (Weeks 3-4)

Design worker base template
Create worker-rust with AFL++, cargo-fuzz
Implement dynamic workflow discovery
Test workflow loading from mounted volume

Phase 3: Migrate Workflows (Weeks 5-6)

Port security_assessment workflow to Temporal
Update workflow metadata format
Test end-to-end flow (upload → analyze → results)
Verify cleanup/lifecycle

Phase 4: Additional Verticals (Weeks 7-8)

Create worker-android, worker-web
Document vertical development guide
Update CLI for MinIO uploads
Update backend API for Temporal

Phase 5: Testing & Docs (Weeks 9-10)

Comprehensive testing
Update README
Migration guide for existing users
Troubleshooting documentation

Total: 10 weeks, rollback possible at any phase

Decision Log

2025-09-30: Architecture Implementation

Decision: Temporal with Vertical Workers
Rationale: Simpler infrastructure, better reliability, clear scaling path

2025-10-01: Vertical Worker Model

Decision: Use long-lived vertical workers instead of ephemeral per-workflow containers
Rationale:
- Zero startup overhead (5s saved per workflow)
- Pre-built toolchains (Android, Rust, Web, etc.)
- Dynamic workflows via mounted volumes (no image rebuild)
- Better marketing (sell verticals, not orchestration)
- Safer Nomad BSL positioning

2025-10-01: Unified MinIO Storage

Decision: Use MinIO for both dev and production (no LocalVolumeStorage)
Rationale:
- Unified codebase (no environment-specific code)
- Lightweight (256MB with CI_CD=true)
- Negligible overhead (2-4s for 250MB upload)
- Better security (no host filesystem mounts)
- Multi-host ready
- Automatic cleanup via lifecycle policies

2025-10-01: Dynamic Workflow Loading

Decision: Mount workflow code as volume, discover at runtime
Rationale:
- Add workflows without rebuilding images
- No registry overhead
- Supports user-contributed workflows
- Faster iteration for developers

Document Version: 2.0 Last Updated: 2025-10-01 Next Review: After Phase 1 implementation (2 weeks)

31 KiB Raw Blame History Unescape Escape

FuzzForge AI Architecture

Table of Contents

Executive Summary

The Architecture

Key Architecture Features

Current Architecture (Temporal + Vertical Workers)

Infrastructure Overview

Service Breakdown

Resource Usage

Vertical Worker Model

Concept

Vertical Taxonomy

Vertical Worker Architecture

Dynamic Workflow Discovery

Workflow Declaration

Storage Strategy (MinIO)

Why MinIO?

MinIO Configuration

Storage Backend Implementation

Performance Characteristics

Workspace Isolation

Dynamic Workflow Loading

The Problem

Implementation

Hot Reload (Optional Advanced Feature)

Architecture Principles

1. Vertical Specialization

2. Unified Storage

3. Dynamic Workflow Discovery

4. Environment-Driven Configuration

5. Fail-Safe Defaults

Component Details

Temporal Server

MinIO

Vertical Workers

Scaling Strategy

Phase 1: Single Host (Now - 6 months)

Phase 2: Multi-Host (6-18 months)

Phase 3: Nomad Cluster (18+ months, if needed)

File Lifecycle Management

Automatic Cleanup via MinIO Lifecycle Policies

Local Cache Eviction (LRU)

Manual Deletion (API)

Retention Policies

Future: Nomad Migration

When to Add Nomad?

Migration Complexity

Nomad Job Example

Licensing Considerations

Migration Timeline

Phase 1: Foundation (Weeks 1-2)

Phase 2: First Vertical Worker (Weeks 3-4)

Phase 3: Migrate Workflows (Weeks 5-6)

Phase 4: Additional Verticals (Weeks 7-8)

Phase 5: Testing & Docs (Weeks 9-10)

Decision Log

2025-09-30: Architecture Implementation

2025-10-01: Vertical Worker Model

2025-10-01: Unified MinIO Storage

2025-10-01: Dynamic Workflow Loading

31 KiB

Raw Blame History