Files
fuzzforge_ai/MIGRATION_DECISION.md
tduhamel42 60ca088ecf CI/CD Integration with Ephemeral Deployment Model (#14)
* feat: Complete migration from Prefect to Temporal

BREAKING CHANGE: Replaces Prefect workflow orchestration with Temporal

## Major Changes
- Replace Prefect with Temporal for workflow orchestration
- Implement vertical worker architecture (rust, android)
- Replace Docker registry with MinIO for unified storage
- Refactor activities to be co-located with workflows
- Update all API endpoints for Temporal compatibility

## Infrastructure
- New: docker-compose.temporal.yaml (Temporal + MinIO + workers)
- New: workers/ directory with rust and android vertical workers
- New: backend/src/temporal/ (manager, discovery)
- New: backend/src/storage/ (S3-cached storage with MinIO)
- New: backend/toolbox/common/ (shared storage activities)
- Deleted: docker-compose.yaml (old Prefect setup)
- Deleted: backend/src/core/prefect_manager.py
- Deleted: backend/src/services/prefect_stats_monitor.py
- Deleted: Docker registry and insecure-registries requirement

## Workflows
- Migrated: security_assessment workflow to Temporal
- New: rust_test workflow (example/test workflow)
- Deleted: secret_detection_scan (Prefect-based, to be reimplemented)
- Activities now co-located with workflows for independent testing

## API Changes
- Updated: backend/src/api/workflows.py (Temporal submission)
- Updated: backend/src/api/runs.py (Temporal status/results)
- Updated: backend/src/main.py (727 lines, TemporalManager integration)
- Updated: All 16 MCP tools to use TemporalManager

## Testing
-  All services healthy (Temporal, PostgreSQL, MinIO, workers, backend)
-  All API endpoints functional
-  End-to-end workflow test passed (72 findings from vulnerable_app)
-  MinIO storage integration working (target upload/download, results)
-  Worker activity discovery working (6 activities registered)
-  Tarball extraction working
-  SARIF report generation working

## Documentation
- ARCHITECTURE.md: Complete Temporal architecture documentation
- QUICKSTART_TEMPORAL.md: Getting started guide
- MIGRATION_DECISION.md: Why we chose Temporal over Prefect
- IMPLEMENTATION_STATUS.md: Migration progress tracking
- workers/README.md: Worker development guide

## Dependencies
- Added: temporalio>=1.6.0
- Added: boto3>=1.34.0 (MinIO S3 client)
- Removed: prefect>=3.4.18

* feat: Add Python fuzzing vertical with Atheris integration

This commit implements a complete Python fuzzing workflow using Atheris:

## Python Worker (workers/python/)
- Dockerfile with Python 3.11, Atheris, and build tools
- Generic worker.py for dynamic workflow discovery
- requirements.txt with temporalio, boto3, atheris dependencies
- Added to docker-compose.temporal.yaml with dedicated cache volume

## AtherisFuzzer Module (backend/toolbox/modules/fuzzer/)
- Reusable module extending BaseModule
- Auto-discovers fuzz targets (fuzz_*.py, *_fuzz.py, fuzz_target.py)
- Recursive search to find targets in nested directories
- Dynamically loads TestOneInput() function
- Configurable max_iterations and timeout
- Real-time stats callback support for live monitoring
- Returns findings as ModuleFinding objects

## Atheris Fuzzing Workflow (backend/toolbox/workflows/atheris_fuzzing/)
- Temporal workflow for orchestrating fuzzing
- Downloads user code from MinIO
- Executes AtherisFuzzer module
- Uploads results to MinIO
- Cleans up cache after execution
- metadata.yaml with vertical: python for routing

## Test Project (test_projects/python_fuzz_waterfall/)
- Demonstrates stateful waterfall vulnerability
- main.py with check_secret() that leaks progress
- fuzz_target.py with Atheris TestOneInput() harness
- Complete README with usage instructions

## Backend Fixes
- Fixed parameter merging in REST API endpoints (workflows.py)
- Changed workflow parameter passing from positional args to kwargs (manager.py)
- Default parameters now properly merged with user parameters

## Testing
 Worker discovered AtherisFuzzingWorkflow
 Workflow executed end-to-end successfully
 Fuzz target auto-discovered in nested directories
 Atheris ran 100,000 iterations
 Results uploaded and cache cleaned

* chore: Complete Temporal migration with updated CLI/SDK/docs

This commit includes all remaining Temporal migration changes:

## CLI Updates (cli/)
- Updated workflow execution commands for Temporal
- Enhanced error handling and exceptions
- Updated dependencies in uv.lock

## SDK Updates (sdk/)
- Client methods updated for Temporal workflows
- Updated models for new workflow execution
- Updated dependencies in uv.lock

## Documentation Updates (docs/)
- Architecture documentation for Temporal
- Workflow concept documentation
- Resource management documentation (new)
- Debugging guide (new)
- Updated tutorials and how-to guides
- Troubleshooting updates

## README Updates
- Main README with Temporal instructions
- Backend README
- CLI README
- SDK README

## Other
- Updated IMPLEMENTATION_STATUS.md
- Removed old vulnerable_app.tar.gz

These changes complete the Temporal migration and ensure the
CLI/SDK work correctly with the new backend.

* fix: Use positional args instead of kwargs for Temporal workflows

The Temporal Python SDK's start_workflow() method doesn't accept
a 'kwargs' parameter. Workflows must receive parameters as positional
arguments via the 'args' parameter.

Changed from:
  args=workflow_args  # Positional arguments

This fixes the error:
  TypeError: Client.start_workflow() got an unexpected keyword argument 'kwargs'

Workflows now correctly receive parameters in order:
- security_assessment: [target_id, scanner_config, analyzer_config, reporter_config]
- atheris_fuzzing: [target_id, target_file, max_iterations, timeout_seconds]
- rust_test: [target_id, test_message]

* fix: Filter metadata-only parameters from workflow arguments

SecurityAssessmentWorkflow was receiving 7 arguments instead of 2-5.
The issue was that target_path and volume_mode from default_parameters
were being passed to the workflow, when they should only be used by
the system for configuration.

Now filters out metadata-only parameters (target_path, volume_mode)
before passing arguments to workflow execution.

* refactor: Remove Prefect leftovers and volume mounting legacy

Complete cleanup of Prefect migration artifacts:

Backend:
- Delete registry.py and workflow_discovery.py (Prefect-specific files)
- Remove Docker validation from setup.py (no longer needed)
- Remove ResourceLimits and VolumeMount models
- Remove target_path and volume_mode from WorkflowSubmission
- Remove supported_volume_modes from API and discovery
- Clean up metadata.yaml files (remove volume/path fields)
- Simplify parameter filtering in manager.py

SDK:
- Remove volume_mode parameter from client methods
- Remove ResourceLimits and VolumeMount models
- Remove Prefect error patterns from docker_logs.py
- Clean up WorkflowSubmission and WorkflowMetadata models

CLI:
- Remove Volume Modes display from workflow info

All removed features are Prefect-specific or Docker volume mounting
artifacts. Temporal workflows use MinIO storage exclusively.

* feat: Add comprehensive test suite and benchmark infrastructure

- Add 68 unit tests for fuzzer, scanner, and analyzer modules
- Implement pytest-based test infrastructure with fixtures
- Add 6 performance benchmarks with category-specific thresholds
- Configure GitHub Actions for automated testing and benchmarking
- Add test and benchmark documentation

Test coverage:
- AtherisFuzzer: 8 tests
- CargoFuzzer: 14 tests
- FileScanner: 22 tests
- SecurityAnalyzer: 24 tests

All tests passing (68/68)
All benchmarks passing (6/6)

* fix: Resolve all ruff linting violations across codebase

Fixed 27 ruff violations in 12 files:
- Removed unused imports (Depends, Dict, Any, Optional, etc.)
- Fixed undefined workflow_info variable in workflows.py
- Removed dead code with undefined variables in atheris_fuzzer.py
- Changed f-string to regular string where no placeholders used

All files now pass ruff checks for CI/CD compliance.

* fix: Configure CI for unit tests only

- Renamed docker-compose.temporal.yaml → docker-compose.yml for CI compatibility
- Commented out integration-tests job (no integration tests yet)
- Updated test-summary to only depend on lint and unit-tests

CI will now run successfully with 68 unit tests. Integration tests can be added later.

* feat: Add CI/CD integration with ephemeral deployment model

Implements comprehensive CI/CD support for FuzzForge with on-demand worker management:

**Worker Management (v0.7.0)**
- Add WorkerManager for automatic worker lifecycle control
- Auto-start workers from stopped state when workflows execute
- Auto-stop workers after workflow completion
- Health checks and startup timeout handling (90s default)

**CI/CD Features**
- `--fail-on` flag: Fail builds based on SARIF severity levels (error/warning/note/info)
- `--export-sarif` flag: Export findings in SARIF 2.1.0 format
- `--auto-start`/`--auto-stop` flags: Control worker lifecycle
- Exit code propagation: Returns 1 on blocking findings, 0 on success

**Exit Code Fix**
- Add `except typer.Exit: raise` handlers at 3 critical locations
- Move worker cleanup to finally block for guaranteed execution
- Exit codes now propagate correctly even when build fails

**CI Scripts & Examples**
- ci-start.sh: Start FuzzForge services with health checks
- ci-stop.sh: Clean shutdown with volume preservation option
- GitHub Actions workflow example (security-scan.yml)
- GitLab CI pipeline example (.gitlab-ci.example.yml)
- docker-compose.ci.yml: CI-optimized compose file with profiles

**OSS-Fuzz Integration**
- New ossfuzz_campaign workflow for running OSS-Fuzz projects
- OSS-Fuzz worker with Docker-in-Docker support
- Configurable campaign duration and project selection

**Documentation**
- Comprehensive CI/CD integration guide (docs/how-to/cicd-integration.md)
- Updated architecture docs with worker lifecycle details
- Updated workspace isolation documentation
- CLI README with worker management examples

**SDK Enhancements**
- Add get_workflow_worker_info() endpoint
- Worker vertical metadata in workflow responses

**Testing**
- All workflows tested: security_assessment, atheris_fuzzing, secret_detection, cargo_fuzzing
- All monitoring commands tested: stats, crashes, status, finding
- Full CI pipeline simulation verified
- Exit codes verified for success/failure scenarios

Ephemeral CI/CD model: ~3-4GB RAM, ~60-90s startup, runs entirely in CI containers.

* fix: Resolve ruff linting violations in CI/CD code

- Remove unused variables (run_id, defaults, result)
- Remove unused imports
- Fix f-string without placeholders

All CI/CD integration files now pass ruff checks.
2025-10-14 10:13:45 +02:00

39 KiB
Raw Permalink Blame History

FuzzForge AI: Migration Decision Document

Date: 2025-10-01 (Updated) Status: Architecture Revised - Ready for Implementation Decision Makers: FuzzingLabs Team Recommendation: Migrate to Temporal with Vertical Workers + MinIO


🔄 CRITICAL UPDATE (2025-10-01)

Initial analysis was incomplete. The original architecture document missed a critical requirement:

"Workflows are dynamic and have to be created without modifying the codebase"

What Changed

The original plan proposed "no registry needed" with long-lived workers, but failed to address how dynamic workflows with custom dependencies would work. This created a fundamental contradiction.

Revised Architecture

New approach: Vertical Workers + MinIO

Aspect Original Plan Revised Plan
Workers Generic long-lived Vertical-specific (Android, Rust, Web, iOS, etc.)
Toolchains Install per workflow Pre-built per vertical
Workflows Unclear Mounted as volume (no rebuild)
Storage LocalVolumeStorage (dev) / S3 (prod) MinIO everywhere (unified)
Target Access Host filesystem mounts Upload to MinIO (secure)
Registry Eliminated Eliminated (workflows in volume, not images)
Services 1 (Temporal only) 6 (Temporal + MinIO + 3+ vertical workers)
Memory "~4.5GB" ~2.3GB (realistic calculation)

Key Insights

  1. Dynamic workflows ARE compatible with long-lived workers via volume mounting
  2. Verticals solve the toolchain problem (pre-built, no per-workflow installs)
  3. MinIO is lightweight (256MB with CI_CD=true) and provides unified storage
  4. No registry overhead (workflow code mounted, not built into images)
  5. Better marketing (sell "security verticals", not "orchestration platform")

What This Means

  • Migration still recommended
  • Timeline extended to 10 weeks (from 8)
  • More services but better architecture
  • Addresses all original pain points
  • Supports dynamic workflows correctly

See ARCHITECTURE.md v2.0 for full details.


Table of Contents

  1. Executive Summary
  2. Current State Analysis
  3. Proposed Solution: Temporal Migration
  4. For & Against: Temporal vs Prefect
  5. For & Against: Long-Lived vs Ephemeral Workers
  6. Future Consideration: Nomad vs Kubernetes vs Docker Compose
  7. Benefits Summary
  8. Risks & Mitigations
  9. Cost Analysis
  10. Timeline & Effort
  11. Licensing Considerations
  12. Recommendation

Executive Summary

The Proposal

Migrate from Prefect to Temporal for workflow orchestration, simplifying infrastructure from 6 services to 1 while maintaining module architecture and preparing for future scale.

Why Consider This?

Current Prefect setup has grown complex with:

  • 6 services to manage (Prefect, Postgres, Redis, Registry, Docker-proxy, Worker)
  • Unclear scaling path for high-volume production
  • Registry overhead for module isolation
  • Complex volume mounting configuration

Key Decision Points

Decision Recommendation Timeline
Replace Prefect? Yes - with Temporal Now (Weeks 1-8)
Worker Strategy? Long-lived containers Now (Weeks 3-4)
Storage Strategy? Abstract layer (Local→S3) Now (Week 3)
Add Nomad? Later - when 10+ hosts 18-24 months
Add Kubernetes? No - unnecessary complexity N/A

Bottom Line

Recommended: Proceed with Temporal migration.

  • Effort: 8 weeks, Medium complexity
  • Risk: Low (rollback possible, modules unchanged)
  • Benefit: 83% infrastructure reduction, clear scaling path, better reliability

Current State Analysis

Prefect Architecture (Current)

Infrastructure:
├─ Prefect Server (orchestration)
├─ Postgres (metadata storage)
├─ Redis (task queue)
├─ Docker Registry (image sharing)
├─ Docker Proxy (container isolation)
└─ Prefect Worker (execution)

Total: 6 services

Strengths of Current Setup

Aspect Rating Notes
Familiarity High Team knows Prefect well
Functionality Good Workflows execute successfully
Module System Excellent BaseModule interface is solid
Documentation Good Internal docs exist

Pain Points

Issue Impact Frequency Severity
Infrastructure Complexity Managing 6 services Continuous High
Registry Overhead Push/pull for every deployment Every change Medium
Unclear Scaling How to go multi-host? Future planning High
Resource Usage ~8GB under load Continuous Medium
Volume Mounting Complex job_variables config Every workflow Medium

Why Change Now?

  1. Planning for Scale: Need clear path from 1 host → multi-host → cluster
  2. Infrastructure Debt: 6 services growing harder to maintain
  3. Better Options Available: Temporal provides simpler, more scalable solution
  4. Module System Stable: Can migrate orchestration without touching modules
  5. Right Time: Before production scale makes migration harder

Proposed Solution: Temporal Migration

Target Architecture

Infrastructure:
├─ Temporal Server (orchestration + storage)
└─ Worker Pools (3 types, auto-discover modules)

Total: 1 service (+ workers)

Migration Phases

Phase 1: Single Host (Weeks 1-8)

  • Replace Prefect with Temporal
  • Long-lived worker pools
  • LocalVolumeStorage (volume mounts)
  • Capacity: 15-50 concurrent workflows

Phase 2: Multi-Host (Months 6-18)

  • Same architecture, multiple hosts
  • Switch to S3CachedStorage
  • Capacity: 3× Phase 1

Phase 3: Nomad Cluster (Months 18+, if needed)

  • Add Nomad for advanced orchestration
  • Auto-scaling, multi-tenancy
  • Capacity: Unlimited horizontal scaling

For & Against: Temporal vs Prefect

Option A: Keep Prefect (Status Quo)

For (Arguments to Keep Prefect)

  1. No Migration Effort

    • Zero weeks of migration work
    • No learning curve
    • No risk of migration issues
  2. Team Familiarity

    • Team knows Prefect well
    • Existing operational runbooks
    • Established debugging patterns
  3. Working System

    • Current workflows function correctly
    • No immediate technical blocker
    • "If it ain't broke, don't fix it"
  4. Deferred Complexity

    • Can delay architecture decisions
    • Focus on feature development
    • Postpone infrastructure changes

Against (Arguments Against Keeping Prefect)

  1. Infrastructure Complexity

    • 6 services to manage and monitor
    • Complex dependencies (Postgres, Redis, Registry)
    • High operational overhead
  2. Scaling Uncertainty

    • Unclear how to scale beyond single host
    • Registry becomes bottleneck at scale
    • No clear multi-host story
  3. Resource Inefficiency

    • ~2GB idle, ~8GB under load
    • Registry storage overhead
    • Redundant service layers
  4. Technical Debt Accumulation

    • Complexity will only increase
    • Harder to migrate later (more workflows)
    • Missing modern features (durable execution)
  5. Prefect Ecosystem Concerns

    • Prefect 3.x changes from 2.x
    • Community split (Cloud vs self-hosted)
    • Uncertain long-term roadmap

For (Arguments to Migrate)

  1. Dramatic Simplification

    • 6 services → 1 service (83% reduction)
    • No registry needed (local images)
    • Simpler volume mounting
  2. Better Reliability

    • Durable execution (workflows survive crashes)
    • Built-in state persistence
    • Proven at massive scale (Netflix, Uber, Snap)
  3. Clear Scaling Path

    • Single host → Multi-host → Nomad cluster
    • Architecture designed for scale
    • Storage abstraction enables seamless transition
  4. Superior Workflow Engine

    • True durable execution vs task queue
    • Better state management
    • Handles long-running workflows (fuzzing campaigns)
    • Activity timeouts and retries built-in
  5. Operational Benefits

    • Better Web UI for debugging
    • Comprehensive workflow history
    • Query workflow state at any time
    • Simpler deployment (single service)
  6. Future-Proof Architecture

    • Easy Nomad migration path (18+ months)
    • Multi-tenancy ready (namespaces)
    • Auto-scaling capable
    • Industry momentum (growing adoption)
  7. Module Preservation

    • Zero changes to BaseModule interface
    • Module discovery unchanged
    • Workflows adapt easily (@flow → @workflow)
  8. Resource Efficiency

    • ~1GB idle, ~4.5GB under load
    • 44% reduction in resource usage
    • No registry storage overhead

Against (Arguments Against Migration)

  1. Migration Effort

    • 8 weeks of focused work
    • Team capacity diverted from features
    • Testing and validation required
  2. Learning Curve

    • New concepts (workflows vs activities)
    • Different debugging approach
    • Team training needed
  3. Migration Risk

    • Potential for workflow disruption
    • Bugs in migration code
    • Temporary performance issues
  4. Unknown Unknowns

    • May discover edge cases
    • Performance characteristics differ
    • Integration challenges possible
  5. Temporal Limitations

    • Less mature than Prefect in some areas
    • Smaller community (growing)
    • Fewer pre-built integrations

Scoring Matrix

Criteria Weight Prefect Temporal Winner
Infrastructure Complexity 25% 3/10 9/10 Temporal
Scalability 20% 4/10 9/10 Temporal
Reliability 20% 7/10 10/10 Temporal
Migration Effort 15% 10/10 4/10 Prefect
Team Familiarity 10% 9/10 3/10 Prefect
Resource Efficiency 10% 5/10 8/10 Temporal
Total 100% 5.5/10 7.65/10 Temporal

Conclusion: Temporal wins on technical merit despite migration costs.


For & Against: Long-Lived vs Ephemeral Workers

Context

Workers can spawn ephemeral containers per workflow (like Prefect) or run as long-lived containers processing multiple workflows.

Option A: Ephemeral Containers

For

  1. Complete Isolation

    • Each workflow in fresh container
    • No state leakage between workflows
    • Maximum security
  2. Automatic Cleanup

    • Containers destroyed after workflow
    • No resource leaks
    • Clean slate every time
  3. Matches Current Behavior

    • Similar to Prefect approach
    • Easier mental model
    • Less architecture change
  4. Simple Development

    • Test with docker run
    • No complex lifecycle management
    • Easy to debug

Against

  1. Performance Overhead

    • 5 second startup per container
    • At 450 workflows/hour: 625 minutes wasted
    • Unacceptable at production scale
  2. Resource Churn

    • Constant container creation/destruction
    • Docker daemon overhead
    • Network/volume setup repeated
  3. Scaling Limitations

    • Can't handle high-volume workloads
    • Startup overhead compounds
    • Poor resource utilization

For

  1. Zero Startup Overhead

    • Containers already running
    • Immediate workflow execution
    • Critical for high-volume production
  2. Resource Efficiency

    • Fixed 4.5GB RAM handles 15 concurrent workflows
    • vs ~76GB for ephemeral approach
    • 10-20× better resource utilization
  3. Predictable Performance

    • Consistent response times
    • No container startup jitter
    • Better SLA capability
  4. Horizontal Scaling

    • Add more workers linearly
    • Each worker handles N concurrent
    • Clear capacity planning
  5. Production-Ready

    • Proven pattern (Uber, Airbnb)
    • Handles thousands of workflows/day
    • Industry standard for scale

Against

  1. Volume Mounting Complexity

    • Must mount parent directories
    • Or implement S3 storage backend
    • More sophisticated configuration
  2. Shared Container State

    • Workers reused across workflows
    • Potential for subtle bugs
    • Requires careful module design
  3. Lifecycle Management

    • Must handle worker restarts
    • Graceful shutdown needed
    • More complex monitoring
  4. Memory Management

    • Workers accumulate memory over time
    • Need periodic restarts
    • Requires memory limits

Decision Matrix

Scenario Ephemeral Long-Lived Winner
Development Simpler ⚠️ Complex Ephemeral
Low Volume (<10/hour) Acceptable Overkill Ephemeral
Medium Volume (10-100/hour) ⚠️ Wasteful Efficient Long-Lived
High Volume (>100/hour) Unusable Required Long-Lived
Production Scale No Yes Long-Lived

Recommendation: Long-lived workers for production deployment.

Compromise: Can start with ephemeral for Phase 1 (proof of concept), migrate to long-lived for Phase 2 (production).


Future Consideration: Nomad vs Kubernetes vs Docker Compose

When to Consider Orchestration Beyond Docker Compose?

Trigger Points:

  • Managing 10+ hosts manually
  • Need multi-tenancy (customer isolation)
  • Require auto-scaling based on metrics
  • Want sophisticated scheduling (bin-packing, constraints)

Timeline Estimate: 18-24 months from now

For

  1. Simplicity

    • Single YAML file
    • No cluster setup
    • Easy to understand and debug
  2. Zero Learning Curve

    • Team already knows Docker
    • Familiar commands
    • Abundant documentation
  3. Sufficient for 1-5 Hosts

    • Deploy same compose file to each host
    • Manual but manageable
    • Works for current scale
  4. Development Friendly

    • Same config dev and prod
    • Fast iteration cycle
    • Easy local testing
  5. No Lock-In

    • Easy to migrate to Nomad/K8s later
    • Workers portable by design
    • Clean exit strategy

Against

  1. Manual Coordination

    • No automatic scheduling
    • Manual load balancing
    • No health-based rescheduling
  2. Limited Scaling

    • Practical limit ~5-10 hosts
    • No auto-scaling
    • Manual capacity planning
  3. No Multi-Tenancy

    • Can't isolate customers
    • No resource quotas
    • Shared infrastructure
  4. Basic Monitoring

    • No cluster-wide metrics
    • Per-host monitoring only
    • Limited observability

Verdict: Perfect for Phase 1 (single host) and Phase 2 (3-5 hosts). Transition to Nomad/K8s at Phase 3.

For

  1. Operational Simplicity

    • Single binary (vs K8s complexity)
    • Easy to install and maintain
    • Lower operational overhead
  2. Perfect Fit for Use Case

    • Batch workload focus
    • Resource management built-in
    • Namespace support for multi-tenancy
  3. Multi-Workload Support

    • Containers (Docker)
    • VMs (QEMU)
    • Bare processes
    • Java JARs
    • All in one scheduler
  4. Scheduling Intelligence

    • Bin-packing for efficiency
    • Constraint-based placement
    • Affinity/anti-affinity rules
    • Resource quotas per namespace
  5. Easy Migration from Docker Compose

    • Similar concepts
    • compose-to-nomad converter tool
    • Workers unchanged
    • 1-2 week migration
  6. HashiCorp Ecosystem

    • Integrates with Consul (service discovery)
    • Integrates with Vault (secrets)
    • Proven at scale (Cloudflare, CircleCI)
  7. Auto-Scaling

    • Built-in scaling policies
    • Prometheus integration
    • Queue-depth based scaling
    • Horizontal scaling automatic

Against

  1. Learning Curve

    • HCL syntax to learn
    • New concepts (allocations, deployments)
    • Consul integration complexity
  2. Smaller Ecosystem

    • Fewer tools than Kubernetes
    • Smaller community
    • Less third-party integrations
  3. Network Isolation

    • Less sophisticated than K8s
    • Requires Consul Connect for service mesh
    • Weaker network policies
  4. Maturity

    • Less mature than Kubernetes
    • Fewer production battle stories
    • Evolving feature set

Verdict: Excellent choice when outgrow Docker Compose. Simpler than K8s, perfect for FuzzForge scale.

Option C: Kubernetes

For

  1. Industry Standard

    • Largest ecosystem
    • Most third-party integrations
    • Abundant expertise available
  2. Feature Richness

    • Sophisticated networking (Network Policies)
    • Advanced scheduling
    • Rich operator ecosystem
    • Helm charts for everything
  3. Multi-Tenancy

    • Strong namespace isolation
    • RBAC fine-grained
    • Network policies
    • Pod Security Policies
  4. Massive Scale

    • Proven to 5,000+ nodes
    • Google-scale reliability
    • Battle-tested
  5. Cloud Integration

    • Native on all clouds (EKS, GKE, AKS)
    • Managed offerings reduce complexity
    • Auto-scaling (HPA, Cluster Autoscaler)

Against

  1. Operational Complexity

    • High learning curve
    • Complex to set up and maintain
    • Requires dedicated ops team
  2. Resource Overhead

    • Control plane resource usage
    • etcd cluster management
    • More moving parts
  3. Overkill for Use Case

    • FuzzForge is batch workload, not microservices
    • Don't need K8s networking complexity
    • Simpler alternatives sufficient
  4. Container-Only

    • Can't run VMs easily
    • Can't run bare processes
    • Nomad more flexible
  5. Cost

    • Higher operational cost
    • More infrastructure required
    • Steeper learning investment

Verdict: Overkill for FuzzForge. Choose only if planning 1,000+ hosts or need extensive ecosystem.

Comparison Matrix

Feature Docker Compose Nomad Kubernetes
Operational Complexity ★☆☆☆☆ (Lowest) ★★☆☆☆ (Low) ★★★★☆ (High)
Learning Curve ★☆☆☆☆ (Easy) ★★★☆☆ (Medium) ★★★★★ (Steep)
Setup Time Minutes 1 day 1-2 weeks
Best For 1-5 hosts 10-500 hosts 500+ hosts
Auto-Scaling No Yes Yes
Multi-Tenancy No Yes (Namespaces) Yes (Advanced)
Workload Types Containers Containers + VMs + Processes Containers (mainly)
Service Mesh No ⚠️ Via Consul Connect Istio/Linkerd
Ecosystem Size Medium Small Huge
Resource Efficiency High High Medium
FuzzForge Fit Phase 1-2 Phase 3+ ⚠️ Unnecessary

Recommendation Timeline

Months 0-6:   Docker Compose (Single Host)
               └─ Simplest, fastest to implement

Months 6-18:  Docker Compose (Multi-Host)
               └─ Scale to 3-5 hosts manually

Months 18+:   Nomad (if needed)
               └─ Add when 10+ hosts or auto-scaling required

Never:        Kubernetes
               └─ Unless scale exceeds 500+ hosts

Benefits Summary

Infrastructure Benefits

Metric Current (Prefect) Future (Temporal) Improvement
Services to Manage 6 1 83% reduction
Idle Memory Usage ~2GB ~1GB 50% reduction
Load Memory Usage ~8GB ~4.5GB 44% reduction
Docker Registry Required Not needed Eliminated
Configuration Files 6 service configs 1 config 83% simpler
Deployment Complexity High Low Significant

Operational Benefits

  1. Simpler Monitoring

    • 1 service vs 6
    • Single Web UI (Temporal)
    • Fewer alerts to configure
  2. Easier Debugging

    • Complete workflow history in Temporal
    • Query workflow state at any time
    • Better error visibility
  3. Faster Deployments

    • No registry push/pull
    • Restart 1 service vs 6
    • Quicker iteration cycles
  4. Better Reliability

    • Durable execution (workflows survive crashes)
    • Automatic retries built-in
    • State persistence guaranteed
  5. Clear Scaling Path

    • Phase 1: Single host (now)
    • Phase 2: Multi-host (6-18 months)
    • Phase 3: Nomad cluster (18+ months)

Developer Experience Benefits

  1. Local Development

    • Simpler docker-compose
    • Faster startup (fewer services)
    • Easier to reason about
  2. Module Development

    • No changes to BaseModule
    • Same discovery mechanism
    • Same testing approach
  3. Workflow Development

    • Better debugging tools (Temporal Web UI)
    • Workflow history visualization
    • Easier to test retry logic
  4. Onboarding

    • 1 service to understand vs 6
    • Clearer architecture
    • Less to learn

Risks & Mitigations

Risk 1: Migration Introduces Bugs

Likelihood: Medium Impact: High Risk Score: 6/10

Mitigation:

  • Phased migration (one workflow at a time)
  • Parallel run (Prefect + Temporal) during transition
  • Comprehensive testing before cutover
  • Rollback plan documented

Risk 2: Performance Degradation

Likelihood: Low Impact: Medium Risk Score: 3/10

Mitigation:

  • Load testing before production
  • Monitor key metrics during migration
  • Temporal proven at higher scale than current
  • Easy to tune worker concurrency

Risk 3: Team Learning Curve

Likelihood: High Impact: Low Risk Score: 4/10

Mitigation:

  • Training sessions on Temporal concepts
  • Pair programming during migration
  • Comprehensive documentation
  • Temporal has excellent docs

Risk 4: Unknown Edge Cases

Likelihood: Medium Impact: Medium Risk Score: 5/10

Mitigation:

  • Thorough testing with real workflows
  • Gradual rollout (dev → staging → production)
  • Keep Prefect running initially
  • Community support available

Risk 5: Module System Incompatibility

Likelihood: Very Low Impact: High Risk Score: 2/10

Mitigation:

  • Module interface preserved (BaseModule unchanged)
  • Only orchestration changes
  • Modules are decoupled from Prefect
  • Test suite validates module behavior

Risk 6: Long-Lived Worker Stability

Likelihood: Low Impact: Medium Risk Score: 3/10

Mitigation:

  • Proper resource limits (memory, CPU)
  • Periodic worker restarts (daily)
  • Monitoring for memory leaks
  • Health checks and auto-restart

Overall Risk Assessment

Total Risk Score: 23/60 (38%) - Medium-Low Risk

Conclusion: Risks are manageable with proper planning and mitigation strategies.


Cost Analysis

Current Costs (Prefect)

Infrastructure:

Single Host (8GB RAM, 4 CPU):
  - Cloud VM: $80-120/month
  - Or bare metal amortized: ~$50/month

Services Running:
  - Prefect Server: ~500MB
  - Postgres: ~200MB
  - Redis: ~100MB
  - Registry: ~500MB
  - Docker Proxy: ~50MB
  - Worker: ~500MB
  - Workflows: ~6GB (peak)
  Total: ~8GB

Development Time:
  - Maintenance: ~2 hours/week
  - Debugging: ~3 hours/week
  - Deployments: ~1 hour/week
  Total: 6 hours/week = $600/month (at $25/hour)

Monthly Total: ~$700/month

Future Costs (Temporal)

Phase 1 - Single Host:

Single Host (6GB RAM, 4 CPU):
  - Cloud VM: $60-80/month
  - Or bare metal amortized: ~$40/month

Services Running:
  - Temporal: ~1GB
  - Workers: ~3.5GB
  - Workflows: ~1GB (peak)
  Total: ~5.5GB

Development Time:
  - Maintenance: ~1 hour/week
  - Debugging: ~2 hours/week
  - Deployments: ~0.5 hour/week
  Total: 3.5 hours/week = $350/month

Monthly Total: ~$430/month

Phase 2 - Multi-Host (3 hosts):

3 Hosts + S3 Storage:
  - Cloud VMs: $180-240/month
  - S3 storage (1TB): ~$23/month
  - S3 transfer (100GB): ~$9/month

Development Time:
  - Maintenance: ~2 hours/week
  - Monitoring: ~2 hours/week
  Total: 4 hours/week = $400/month

Monthly Total: ~$670/month (3× capacity)

Phase 3 - Nomad Cluster (10+ hosts):

Nomad Cluster:
  - 3 Nomad servers: $120/month
  - 10 worker hosts: $800/month
  - S3 storage (5TB): ~$115/month
  - Load balancer: ~$20/month

Development Time:
  - Nomad maintenance: ~3 hours/week
  - Monitoring: ~3 hours/week
  Total: 6 hours/week = $600/month

Monthly Total: ~$1,655/month (10× capacity)

Cost Comparison

Phase Hosts Capacity Monthly Cost Cost per Workflow
Current (Prefect) 1 10K/day $700 $0.0023
Phase 1 (Temporal) 1 10K/day $430 $0.0014
Phase 2 (Temporal) 3 30K/day $670 $0.0007
Phase 3 (Nomad) 10 100K/day $1,655 $0.0005

Savings:

  • Phase 1 vs Current: $270/month (39% reduction)
  • Better cost efficiency as scale increases

Timeline & Effort

Phase 1: Temporal Migration (8 Weeks)

Week 1-2: Foundation

  • Deploy Temporal server
  • Remove Prefect infrastructure
  • Implement storage abstraction layer
  • Effort: 60-80 hours

Week 3-4: Workers

  • Create long-lived worker pools
  • Implement module auto-discovery
  • Configure Docker Compose
  • Effort: 60-80 hours

Week 5-6: Workflows

  • Migrate workflows to Temporal
  • Convert @flow → @workflow.defn
  • Test all workflows
  • Effort: 60-80 hours

Week 7: Integration

  • Update backend API
  • End-to-end testing
  • Load testing
  • Effort: 40-60 hours

Week 8: Documentation & Cleanup

  • Update documentation
  • Remove old code
  • Training sessions
  • Effort: 30-40 hours

Total Effort: 250-340 hours (~2 engineers for 2 months)

Phase 2: Multi-Host (When Needed)

Effort: 40-60 hours

  • Set up S3 storage
  • Deploy to multiple hosts
  • Configure load balancing
  • Test and validate

Phase 3: Nomad (If Needed)

Effort: 80-120 hours

  • Install Nomad cluster
  • Convert jobs to Nomad
  • Set up auto-scaling
  • Production deployment

Licensing Considerations

Overview

Critical Context: FuzzForge is a generic platform where modules and workflows "could be anything" - not limited to fuzzing or security analysis. This significantly impacts the licensing assessment, particularly for Nomad's Business Source License.

Temporal Licensing: SAFE

License: MIT License

Status: Fully open source, zero restrictions

Commercial Use:

  • Use in production
  • Sell services built on Temporal
  • Modify source code
  • Redistribute
  • Sublicense
  • Private use

Conclusion: Temporal has no licensing concerns for any use case. You can build any type of platform (fuzzing, security, generic workflows, orchestration-as-a-service) without legal risk.

Reference: https://github.com/temporalio/temporal/blob/master/LICENSE


Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION

License: Business Source License 1.1 (BSL 1.1)

Status: Source-available but with restrictions

BSL 1.1 Key Terms

Change Date: 4 years after each version release Change License: Mozilla Public License 2.0 (MPL 2.0)

After 4 years: Each version becomes fully open source under MPL 2.0

The Critical Restriction

Additional Use Grant:
You may make use of the Licensed Work, provided that you do not use
the Licensed Work for a Competitive Offering.

A "Competitive Offering" is a commercial product or service that is:
1. Substantially similar to the capabilities of the Licensed Work
2. Offered to third parties on a paid or free basis

What This Means for FuzzForge

The licensing risk depends on how FuzzForge is marketed and positioned:

LIKELY SAFE: Specific Use Case Platform

If FuzzForge is marketed as a specialized platform for specific domains:

Examples:

  • "FuzzForge - Security Analysis Platform"
  • "FuzzForge - Automated Fuzzing Service"
  • "FuzzForge - Code Analysis Tooling"
  • "FuzzForge - Vulnerability Assessment Platform"

Why Safe:

  • Nomad is used internally for infrastructure
  • Customer is buying fuzzing/security services, not orchestration
  • Platform's value is the domain expertise, not the scheduler
  • Not competing with HashiCorp's offerings
⚠️ GRAY AREA: Generic Workflow Platform

If FuzzForge pivots to emphasize generic workflow capabilities:

Examples:

  • ⚠️ "FuzzForge - Workflow Orchestration Platform"
  • ⚠️ "FuzzForge - Run any containerized workload"
  • ⚠️ "FuzzForge - Generic task scheduler"
  • ⚠️ Marketing that emphasizes "powered by Nomad"

Why Risky:

  • Could be seen as competing with Nomad Enterprise
  • Offering similar capabilities to HashiCorp's products
  • Customer might use it as Nomad replacement
CLEARLY VIOLATES: Orchestration-as-a-Service

If FuzzForge becomes primarily an orchestration product:

Examples:

  • "FuzzForge Orchestrator - Schedule any workload"
  • "Nomad-as-a-Service powered by FuzzForge"
  • "Generic container orchestration platform"
  • Reselling Nomad capabilities with thin wrapper

Why Violation:

  • Directly competing with HashiCorp Nomad offerings
  • "Substantially similar" to Nomad's capabilities
  • Commercial offering of orchestration

Real-World Precedents

HashiCorp has NOT (as of 2025) aggressively enforced BSL against companies using their tools internally. The restriction targets:

  • Cloud providers offering "managed Nomad" services
  • Companies building Nomad competitors
  • Vendors reselling HashiCorp functionality

NOT targeting:

  • Companies using Nomad for internal infrastructure
  • SaaS platforms that happen to use Nomad
  • Domain-specific platforms (like FuzzForge's security focus)

Decision Tree: Should I Use Nomad?

┌─────────────────────────────────────┐
│ Is orchestration your core product? │
└─────────────────────────────────────┘
                  │
         ┌────────┴────────┐
         │                 │
        YES               NO
         │                 │
    ┌────┴────┐       ┌────┴────┐
    │ DON'T   │       │ What's   │
    │ USE     │       │ your     │
    │ NOMAD   │       │ value    │
    │         │       │ prop?    │
    └─────────┘       └─────┬────┘
                            │
                ┌───────────┴───────────┐
                │                       │
          Domain Expertise      Orchestration Features
          (Fuzzing, Security)   (Scheduling, Auto-scale)
                │                       │
           ┌────┴────┐            ┌────┴────┐
           │ SAFE TO │            │ RISKY - │
           │ USE     │            │ CONSULT │
           │ NOMAD   │            │ LAWYER  │
           └─────────┘            └─────────┘

FuzzForge Current Position

Current Positioning: Domain-specific security/analysis platform Nomad Usage: Internal infrastructure (not customer-facing) Risk Level: LOW (likely safe)

However, user stated: "modules and workflows could be anything" - this suggests potential future expansion beyond security domain.

If FuzzForge pivots to generic platform:

  • Risk increases from LOW → MEDIUM
  • Need legal review before Phase 3 (Nomad migration)
  • Consider Kubernetes as alternative

Kubernetes Licensing: SAFE

License: Apache License 2.0

Status: Fully open source, zero restrictions

Commercial Use:

  • Use in production
  • Sell services built on Kubernetes
  • Modify source code
  • Offer managed Kubernetes (AWS EKS, GCP GKE do this)
  • Build competitive offerings

Conclusion: Kubernetes has no licensing concerns whatsoever, even for orchestration-as-a-service offerings.


Docker Licensing: SAFE

License: Apache License 2.0

Status: Fully open source

Note: Docker Desktop has separate commercial licensing requirements for organizations >250 employees or >$10M revenue, but Docker Engine (which FuzzForge uses) remains free for all uses.


Licensing Recommendation Matrix

Component License FuzzForge Risk Recommendation
Temporal MIT None Use freely
Docker Engine Apache 2.0 None Use freely
Nomad BSL 1.1 ⚠️ Low-Medium Safe if domain-specific
Kubernetes Apache 2.0 None Safe alternative to Nomad

Recommendations by Phase

Phase 1 & 2: Temporal + Docker Compose

Licenses: MIT (Temporal) + Apache 2.0 (Docker) Risk: ZERO - Fully safe for any use case

Action: Proceed without legal review required


Phase 3: Adding Nomad (18+ months)

License: BSL 1.1 Risk: ⚠️ LOW-MEDIUM - Depends on positioning

Action Required BEFORE Migration:

  1. Clarify Product Positioning

    • Will FuzzForge market as generic platform?
    • Or remain domain-specific (security/fuzzing)?
  2. Legal Review (Recommended)

    • Consult IP lawyer familiar with BSL
    • Show marketing materials, website copy
    • Get written opinion on BSL compliance
    • Cost: $2,000-5,000 (one-time)
  3. Decision Point:

    IF positioning = domain-specific (security/fuzzing)
    THEN proceed with Nomad (low risk)
    
    ELSE IF positioning = generic platform
    THEN consider Kubernetes instead (zero risk)
    

Alternative: Use Kubernetes Instead of Nomad

If concerned about Nomad BSL risk:

Pros:

  • Zero licensing risk (Apache 2.0)
  • Can offer orchestration-as-a-service freely
  • Larger ecosystem and community
  • Managed offerings on all clouds

Cons:

  • Higher operational complexity than Nomad
  • Overkill for batch workload use case
  • Steeper learning curve

When to Choose K8s Over Nomad:

  • Planning to market as generic platform
  • Uncomfortable with BSL restrictions
  • Need absolute licensing certainty
  • Have K8s expertise already

Licensing Risk Summary

Scenario Temporal Docker Nomad Kubernetes
Security platform (current) Safe Safe Safe Safe
Generic workflow platform Safe Safe ⚠️ Risky Safe
Orchestration-as-a-service Safe Safe Violation Safe

Key Takeaways

  1. Temporal is completely safe - MIT license has zero restrictions for any use case

  2. Nomad's BSL depends on positioning:

    • Safe for domain-specific platforms (security, fuzzing)
    • ⚠️ Risky for generic workflow platforms
    • Violation for orchestration-as-a-service
  3. User's statement matters: "modules could be anything" suggests generic platform potential → increases Nomad risk

  4. Mitigation strategies:

    • Keep marketing focused on domain expertise
    • Get legal review before Phase 3 (Nomad)
    • Alternative: Use Kubernetes (Apache 2.0) instead
  5. Decision timing: No urgency - Nomad decision is 18+ months away (Phase 3)

  6. Recommended approach:

    Now → Phase 1-2:    Temporal + Docker Compose (zero risk)
    18 months → Phase 3: Re-evaluate positioning
                         → Domain-specific? Use Nomad
                         → Generic platform? Use Kubernetes
    

Recommendation

Primary Recommendation: PROCEED WITH TEMPORAL MIGRATION

Confidence Level: High (8/10)

Rationale

  1. Technical Benefits Outweigh Costs

    • 83% infrastructure reduction
    • 44% resource savings
    • Clear scaling path
    • Better reliability
  2. Manageable Risks

    • Low-medium risk profile
    • Good mitigation strategies
    • Rollback plan exists
    • Module system preserved
  3. Right Timing

    • Before production scale makes migration harder
    • Team capacity available
    • Module architecture stable
    • Clear 8-week timeline
  4. Future-Proof

    • Easy Nomad migration when needed
    • Multi-host ready (storage abstraction)
    • Industry-proven technology
    • Growing ecosystem

Phased Approach

Immediate (Now):

  • Approve Temporal migration
  • Allocate 2 engineers for 8 weeks
  • Set Week 1 start date

Near-Term (Months 1-6):

  • Complete Temporal migration
  • Validate in production
  • Optimize performance

Mid-Term (Months 6-18):

  • Monitor scaling needs
  • Implement S3 storage if needed
  • Expand to multi-host if needed

Long-Term (Months 18+):

  • Evaluate Nomad necessity
  • Migrate to Nomad if triggers met
  • Continue scaling horizontally

Decision Criteria

Proceed with Migration if:

  • Team agrees on benefits (CHECK)
  • 8-week timeline acceptable (CHECK)
  • Resources available (CHECK)
  • Risk profile acceptable (CHECK)

Defer Migration if:

  • Critical features launching soon (DEPENDS)
  • Team capacity constrained (DEPENDS)
  • Major Prefect improvements announced (UNLIKELY)

Alternative: Start Smaller

If full migration seems risky:

  1. Proof of Concept (2 weeks)

    • Migrate one simple workflow
    • Validate Temporal locally
    • Assess complexity
    • Decision point: Continue or abort
  2. Parallel Run (4 weeks)

    • Run Temporal alongside Prefect
    • Duplicate one workflow
    • Compare results
    • Build confidence
  3. Full Migration (6 weeks)

    • If POC successful, proceed
    • Migrate remaining workflows
    • Decommission Prefect

Total: 12 weeks (vs 8 weeks direct)


Appendix: Quick Reference

One-Page Summary

WHAT: Migrate from Prefect to Temporal WHY: Simpler (6 services → 1), more scalable, better reliability WHEN: Now (8 weeks) WHO: 2 engineers COST: $430/month (vs $700 current) = 39% savings RISK: Medium-Low (manageable) OUTCOME: Production-ready infrastructure with clear scaling path

Key Metrics

Metric Current Future Change
Services 6 1 -83%
Memory 8GB 4.5GB -44%
Cost $700/mo $430/mo -39%
Capacity 10K/day 10K/day Same (Phase 1)
Dev Time 6h/week 3.5h/week -42%

Decision Checklist

  • Review this document with team
  • Discuss concerns and questions
  • Vote: Proceed / Defer / Reject
  • If proceed: Assign engineers
  • If proceed: Set start date
  • If defer: Set review date (3 months)
  • If reject: Document reasons

Document Version: 1.0 Last Updated: 2025-09-30 Next Review: After decision or in 3 months