* feat: Complete migration from Prefect to Temporal BREAKING CHANGE: Replaces Prefect workflow orchestration with Temporal ## Major Changes - Replace Prefect with Temporal for workflow orchestration - Implement vertical worker architecture (rust, android) - Replace Docker registry with MinIO for unified storage - Refactor activities to be co-located with workflows - Update all API endpoints for Temporal compatibility ## Infrastructure - New: docker-compose.temporal.yaml (Temporal + MinIO + workers) - New: workers/ directory with rust and android vertical workers - New: backend/src/temporal/ (manager, discovery) - New: backend/src/storage/ (S3-cached storage with MinIO) - New: backend/toolbox/common/ (shared storage activities) - Deleted: docker-compose.yaml (old Prefect setup) - Deleted: backend/src/core/prefect_manager.py - Deleted: backend/src/services/prefect_stats_monitor.py - Deleted: Docker registry and insecure-registries requirement ## Workflows - Migrated: security_assessment workflow to Temporal - New: rust_test workflow (example/test workflow) - Deleted: secret_detection_scan (Prefect-based, to be reimplemented) - Activities now co-located with workflows for independent testing ## API Changes - Updated: backend/src/api/workflows.py (Temporal submission) - Updated: backend/src/api/runs.py (Temporal status/results) - Updated: backend/src/main.py (727 lines, TemporalManager integration) - Updated: All 16 MCP tools to use TemporalManager ## Testing - ✅ All services healthy (Temporal, PostgreSQL, MinIO, workers, backend) - ✅ All API endpoints functional - ✅ End-to-end workflow test passed (72 findings from vulnerable_app) - ✅ MinIO storage integration working (target upload/download, results) - ✅ Worker activity discovery working (6 activities registered) - ✅ Tarball extraction working - ✅ SARIF report generation working ## Documentation - ARCHITECTURE.md: Complete Temporal architecture documentation - QUICKSTART_TEMPORAL.md: Getting started guide - MIGRATION_DECISION.md: Why we chose Temporal over Prefect - IMPLEMENTATION_STATUS.md: Migration progress tracking - workers/README.md: Worker development guide ## Dependencies - Added: temporalio>=1.6.0 - Added: boto3>=1.34.0 (MinIO S3 client) - Removed: prefect>=3.4.18 * feat: Add Python fuzzing vertical with Atheris integration This commit implements a complete Python fuzzing workflow using Atheris: ## Python Worker (workers/python/) - Dockerfile with Python 3.11, Atheris, and build tools - Generic worker.py for dynamic workflow discovery - requirements.txt with temporalio, boto3, atheris dependencies - Added to docker-compose.temporal.yaml with dedicated cache volume ## AtherisFuzzer Module (backend/toolbox/modules/fuzzer/) - Reusable module extending BaseModule - Auto-discovers fuzz targets (fuzz_*.py, *_fuzz.py, fuzz_target.py) - Recursive search to find targets in nested directories - Dynamically loads TestOneInput() function - Configurable max_iterations and timeout - Real-time stats callback support for live monitoring - Returns findings as ModuleFinding objects ## Atheris Fuzzing Workflow (backend/toolbox/workflows/atheris_fuzzing/) - Temporal workflow for orchestrating fuzzing - Downloads user code from MinIO - Executes AtherisFuzzer module - Uploads results to MinIO - Cleans up cache after execution - metadata.yaml with vertical: python for routing ## Test Project (test_projects/python_fuzz_waterfall/) - Demonstrates stateful waterfall vulnerability - main.py with check_secret() that leaks progress - fuzz_target.py with Atheris TestOneInput() harness - Complete README with usage instructions ## Backend Fixes - Fixed parameter merging in REST API endpoints (workflows.py) - Changed workflow parameter passing from positional args to kwargs (manager.py) - Default parameters now properly merged with user parameters ## Testing ✅ Worker discovered AtherisFuzzingWorkflow ✅ Workflow executed end-to-end successfully ✅ Fuzz target auto-discovered in nested directories ✅ Atheris ran 100,000 iterations ✅ Results uploaded and cache cleaned * chore: Complete Temporal migration with updated CLI/SDK/docs This commit includes all remaining Temporal migration changes: ## CLI Updates (cli/) - Updated workflow execution commands for Temporal - Enhanced error handling and exceptions - Updated dependencies in uv.lock ## SDK Updates (sdk/) - Client methods updated for Temporal workflows - Updated models for new workflow execution - Updated dependencies in uv.lock ## Documentation Updates (docs/) - Architecture documentation for Temporal - Workflow concept documentation - Resource management documentation (new) - Debugging guide (new) - Updated tutorials and how-to guides - Troubleshooting updates ## README Updates - Main README with Temporal instructions - Backend README - CLI README - SDK README ## Other - Updated IMPLEMENTATION_STATUS.md - Removed old vulnerable_app.tar.gz These changes complete the Temporal migration and ensure the CLI/SDK work correctly with the new backend. * fix: Use positional args instead of kwargs for Temporal workflows The Temporal Python SDK's start_workflow() method doesn't accept a 'kwargs' parameter. Workflows must receive parameters as positional arguments via the 'args' parameter. Changed from: args=workflow_args # Positional arguments This fixes the error: TypeError: Client.start_workflow() got an unexpected keyword argument 'kwargs' Workflows now correctly receive parameters in order: - security_assessment: [target_id, scanner_config, analyzer_config, reporter_config] - atheris_fuzzing: [target_id, target_file, max_iterations, timeout_seconds] - rust_test: [target_id, test_message] * fix: Filter metadata-only parameters from workflow arguments SecurityAssessmentWorkflow was receiving 7 arguments instead of 2-5. The issue was that target_path and volume_mode from default_parameters were being passed to the workflow, when they should only be used by the system for configuration. Now filters out metadata-only parameters (target_path, volume_mode) before passing arguments to workflow execution. * refactor: Remove Prefect leftovers and volume mounting legacy Complete cleanup of Prefect migration artifacts: Backend: - Delete registry.py and workflow_discovery.py (Prefect-specific files) - Remove Docker validation from setup.py (no longer needed) - Remove ResourceLimits and VolumeMount models - Remove target_path and volume_mode from WorkflowSubmission - Remove supported_volume_modes from API and discovery - Clean up metadata.yaml files (remove volume/path fields) - Simplify parameter filtering in manager.py SDK: - Remove volume_mode parameter from client methods - Remove ResourceLimits and VolumeMount models - Remove Prefect error patterns from docker_logs.py - Clean up WorkflowSubmission and WorkflowMetadata models CLI: - Remove Volume Modes display from workflow info All removed features are Prefect-specific or Docker volume mounting artifacts. Temporal workflows use MinIO storage exclusively. * feat: Add comprehensive test suite and benchmark infrastructure - Add 68 unit tests for fuzzer, scanner, and analyzer modules - Implement pytest-based test infrastructure with fixtures - Add 6 performance benchmarks with category-specific thresholds - Configure GitHub Actions for automated testing and benchmarking - Add test and benchmark documentation Test coverage: - AtherisFuzzer: 8 tests - CargoFuzzer: 14 tests - FileScanner: 22 tests - SecurityAnalyzer: 24 tests All tests passing (68/68) All benchmarks passing (6/6) * fix: Resolve all ruff linting violations across codebase Fixed 27 ruff violations in 12 files: - Removed unused imports (Depends, Dict, Any, Optional, etc.) - Fixed undefined workflow_info variable in workflows.py - Removed dead code with undefined variables in atheris_fuzzer.py - Changed f-string to regular string where no placeholders used All files now pass ruff checks for CI/CD compliance. * fix: Configure CI for unit tests only - Renamed docker-compose.temporal.yaml → docker-compose.yml for CI compatibility - Commented out integration-tests job (no integration tests yet) - Updated test-summary to only depend on lint and unit-tests CI will now run successfully with 68 unit tests. Integration tests can be added later. * feat: Add CI/CD integration with ephemeral deployment model Implements comprehensive CI/CD support for FuzzForge with on-demand worker management: **Worker Management (v0.7.0)** - Add WorkerManager for automatic worker lifecycle control - Auto-start workers from stopped state when workflows execute - Auto-stop workers after workflow completion - Health checks and startup timeout handling (90s default) **CI/CD Features** - `--fail-on` flag: Fail builds based on SARIF severity levels (error/warning/note/info) - `--export-sarif` flag: Export findings in SARIF 2.1.0 format - `--auto-start`/`--auto-stop` flags: Control worker lifecycle - Exit code propagation: Returns 1 on blocking findings, 0 on success **Exit Code Fix** - Add `except typer.Exit: raise` handlers at 3 critical locations - Move worker cleanup to finally block for guaranteed execution - Exit codes now propagate correctly even when build fails **CI Scripts & Examples** - ci-start.sh: Start FuzzForge services with health checks - ci-stop.sh: Clean shutdown with volume preservation option - GitHub Actions workflow example (security-scan.yml) - GitLab CI pipeline example (.gitlab-ci.example.yml) - docker-compose.ci.yml: CI-optimized compose file with profiles **OSS-Fuzz Integration** - New ossfuzz_campaign workflow for running OSS-Fuzz projects - OSS-Fuzz worker with Docker-in-Docker support - Configurable campaign duration and project selection **Documentation** - Comprehensive CI/CD integration guide (docs/how-to/cicd-integration.md) - Updated architecture docs with worker lifecycle details - Updated workspace isolation documentation - CLI README with worker management examples **SDK Enhancements** - Add get_workflow_worker_info() endpoint - Worker vertical metadata in workflow responses **Testing** - All workflows tested: security_assessment, atheris_fuzzing, secret_detection, cargo_fuzzing - All monitoring commands tested: stats, crashes, status, finding - Full CI pipeline simulation verified - Exit codes verified for success/failure scenarios Ephemeral CI/CD model: ~3-4GB RAM, ~60-90s startup, runs entirely in CI containers. * fix: Resolve ruff linting violations in CI/CD code - Remove unused variables (run_id, defaults, result) - Remove unused imports - Fix f-string without placeholders All CI/CD integration files now pass ruff checks.
39 KiB
FuzzForge AI: Migration Decision Document
Date: 2025-10-01 (Updated) Status: Architecture Revised - Ready for Implementation Decision Makers: FuzzingLabs Team Recommendation: Migrate to Temporal with Vertical Workers + MinIO
🔄 CRITICAL UPDATE (2025-10-01)
Initial analysis was incomplete. The original architecture document missed a critical requirement:
"Workflows are dynamic and have to be created without modifying the codebase"
What Changed
The original plan proposed "no registry needed" with long-lived workers, but failed to address how dynamic workflows with custom dependencies would work. This created a fundamental contradiction.
Revised Architecture
New approach: Vertical Workers + MinIO
| Aspect | Original Plan | Revised Plan |
|---|---|---|
| Workers | Generic long-lived | Vertical-specific (Android, Rust, Web, iOS, etc.) |
| Toolchains | Install per workflow | Pre-built per vertical |
| Workflows | Unclear | Mounted as volume (no rebuild) |
| Storage | LocalVolumeStorage (dev) / S3 (prod) | MinIO everywhere (unified) |
| Target Access | Host filesystem mounts | Upload to MinIO (secure) |
| Registry | Eliminated | Eliminated (workflows in volume, not images) |
| Services | 1 (Temporal only) | 6 (Temporal + MinIO + 3+ vertical workers) |
| Memory | "~4.5GB" | ~2.3GB (realistic calculation) |
Key Insights
- Dynamic workflows ARE compatible with long-lived workers via volume mounting
- Verticals solve the toolchain problem (pre-built, no per-workflow installs)
- MinIO is lightweight (256MB with CI_CD=true) and provides unified storage
- No registry overhead (workflow code mounted, not built into images)
- Better marketing (sell "security verticals", not "orchestration platform")
What This Means
- ✅ Migration still recommended
- ✅ Timeline extended to 10 weeks (from 8)
- ✅ More services but better architecture
- ✅ Addresses all original pain points
- ✅ Supports dynamic workflows correctly
See ARCHITECTURE.md v2.0 for full details.
Table of Contents
- Executive Summary
- Current State Analysis
- Proposed Solution: Temporal Migration
- For & Against: Temporal vs Prefect
- For & Against: Long-Lived vs Ephemeral Workers
- Future Consideration: Nomad vs Kubernetes vs Docker Compose
- Benefits Summary
- Risks & Mitigations
- Cost Analysis
- Timeline & Effort
- Licensing Considerations
- Recommendation
Executive Summary
The Proposal
Migrate from Prefect to Temporal for workflow orchestration, simplifying infrastructure from 6 services to 1 while maintaining module architecture and preparing for future scale.
Why Consider This?
Current Prefect setup has grown complex with:
- 6 services to manage (Prefect, Postgres, Redis, Registry, Docker-proxy, Worker)
- Unclear scaling path for high-volume production
- Registry overhead for module isolation
- Complex volume mounting configuration
Key Decision Points
| Decision | Recommendation | Timeline |
|---|---|---|
| Replace Prefect? | ✅ Yes - with Temporal | Now (Weeks 1-8) |
| Worker Strategy? | ✅ Long-lived containers | Now (Weeks 3-4) |
| Storage Strategy? | ✅ Abstract layer (Local→S3) | Now (Week 3) |
| Add Nomad? | ⏳ Later - when 10+ hosts | 18-24 months |
| Add Kubernetes? | ❌ No - unnecessary complexity | N/A |
Bottom Line
Recommended: Proceed with Temporal migration.
- Effort: 8 weeks, Medium complexity
- Risk: Low (rollback possible, modules unchanged)
- Benefit: 83% infrastructure reduction, clear scaling path, better reliability
Current State Analysis
Prefect Architecture (Current)
Infrastructure:
├─ Prefect Server (orchestration)
├─ Postgres (metadata storage)
├─ Redis (task queue)
├─ Docker Registry (image sharing)
├─ Docker Proxy (container isolation)
└─ Prefect Worker (execution)
Total: 6 services
Strengths of Current Setup
| Aspect | Rating | Notes |
|---|---|---|
| Familiarity | ✅ High | Team knows Prefect well |
| Functionality | ✅ Good | Workflows execute successfully |
| Module System | ✅ Excellent | BaseModule interface is solid |
| Documentation | ✅ Good | Internal docs exist |
Pain Points
| Issue | Impact | Frequency | Severity |
|---|---|---|---|
| Infrastructure Complexity | Managing 6 services | Continuous | High |
| Registry Overhead | Push/pull for every deployment | Every change | Medium |
| Unclear Scaling | How to go multi-host? | Future planning | High |
| Resource Usage | ~8GB under load | Continuous | Medium |
| Volume Mounting | Complex job_variables config | Every workflow | Medium |
Why Change Now?
- Planning for Scale: Need clear path from 1 host → multi-host → cluster
- Infrastructure Debt: 6 services growing harder to maintain
- Better Options Available: Temporal provides simpler, more scalable solution
- Module System Stable: Can migrate orchestration without touching modules
- Right Time: Before production scale makes migration harder
Proposed Solution: Temporal Migration
Target Architecture
Infrastructure:
├─ Temporal Server (orchestration + storage)
└─ Worker Pools (3 types, auto-discover modules)
Total: 1 service (+ workers)
Migration Phases
Phase 1: Single Host (Weeks 1-8)
- Replace Prefect with Temporal
- Long-lived worker pools
- LocalVolumeStorage (volume mounts)
- Capacity: 15-50 concurrent workflows
Phase 2: Multi-Host (Months 6-18)
- Same architecture, multiple hosts
- Switch to S3CachedStorage
- Capacity: 3× Phase 1
Phase 3: Nomad Cluster (Months 18+, if needed)
- Add Nomad for advanced orchestration
- Auto-scaling, multi-tenancy
- Capacity: Unlimited horizontal scaling
For & Against: Temporal vs Prefect
Option A: Keep Prefect (Status Quo)
✅ For (Arguments to Keep Prefect)
-
No Migration Effort
- Zero weeks of migration work
- No learning curve
- No risk of migration issues
-
Team Familiarity
- Team knows Prefect well
- Existing operational runbooks
- Established debugging patterns
-
Working System
- Current workflows function correctly
- No immediate technical blocker
- "If it ain't broke, don't fix it"
-
Deferred Complexity
- Can delay architecture decisions
- Focus on feature development
- Postpone infrastructure changes
❌ Against (Arguments Against Keeping Prefect)
-
Infrastructure Complexity
- 6 services to manage and monitor
- Complex dependencies (Postgres, Redis, Registry)
- High operational overhead
-
Scaling Uncertainty
- Unclear how to scale beyond single host
- Registry becomes bottleneck at scale
- No clear multi-host story
-
Resource Inefficiency
- ~2GB idle, ~8GB under load
- Registry storage overhead
- Redundant service layers
-
Technical Debt Accumulation
- Complexity will only increase
- Harder to migrate later (more workflows)
- Missing modern features (durable execution)
-
Prefect Ecosystem Concerns
- Prefect 3.x changes from 2.x
- Community split (Cloud vs self-hosted)
- Uncertain long-term roadmap
Option B: Migrate to Temporal (Recommended)
✅ For (Arguments to Migrate)
-
Dramatic Simplification
- 6 services → 1 service (83% reduction)
- No registry needed (local images)
- Simpler volume mounting
-
Better Reliability
- Durable execution (workflows survive crashes)
- Built-in state persistence
- Proven at massive scale (Netflix, Uber, Snap)
-
Clear Scaling Path
- Single host → Multi-host → Nomad cluster
- Architecture designed for scale
- Storage abstraction enables seamless transition
-
Superior Workflow Engine
- True durable execution vs task queue
- Better state management
- Handles long-running workflows (fuzzing campaigns)
- Activity timeouts and retries built-in
-
Operational Benefits
- Better Web UI for debugging
- Comprehensive workflow history
- Query workflow state at any time
- Simpler deployment (single service)
-
Future-Proof Architecture
- Easy Nomad migration path (18+ months)
- Multi-tenancy ready (namespaces)
- Auto-scaling capable
- Industry momentum (growing adoption)
-
Module Preservation
- Zero changes to BaseModule interface
- Module discovery unchanged
- Workflows adapt easily (@flow → @workflow)
-
Resource Efficiency
- ~1GB idle, ~4.5GB under load
- 44% reduction in resource usage
- No registry storage overhead
❌ Against (Arguments Against Migration)
-
Migration Effort
- 8 weeks of focused work
- Team capacity diverted from features
- Testing and validation required
-
Learning Curve
- New concepts (workflows vs activities)
- Different debugging approach
- Team training needed
-
Migration Risk
- Potential for workflow disruption
- Bugs in migration code
- Temporary performance issues
-
Unknown Unknowns
- May discover edge cases
- Performance characteristics differ
- Integration challenges possible
-
Temporal Limitations
- Less mature than Prefect in some areas
- Smaller community (growing)
- Fewer pre-built integrations
Scoring Matrix
| Criteria | Weight | Prefect | Temporal | Winner |
|---|---|---|---|---|
| Infrastructure Complexity | 25% | 3/10 | 9/10 | Temporal |
| Scalability | 20% | 4/10 | 9/10 | Temporal |
| Reliability | 20% | 7/10 | 10/10 | Temporal |
| Migration Effort | 15% | 10/10 | 4/10 | Prefect |
| Team Familiarity | 10% | 9/10 | 3/10 | Prefect |
| Resource Efficiency | 10% | 5/10 | 8/10 | Temporal |
| Total | 100% | 5.5/10 | 7.65/10 | Temporal |
Conclusion: Temporal wins on technical merit despite migration costs.
For & Against: Long-Lived vs Ephemeral Workers
Context
Workers can spawn ephemeral containers per workflow (like Prefect) or run as long-lived containers processing multiple workflows.
Option A: Ephemeral Containers
✅ For
-
Complete Isolation
- Each workflow in fresh container
- No state leakage between workflows
- Maximum security
-
Automatic Cleanup
- Containers destroyed after workflow
- No resource leaks
- Clean slate every time
-
Matches Current Behavior
- Similar to Prefect approach
- Easier mental model
- Less architecture change
-
Simple Development
- Test with
docker run - No complex lifecycle management
- Easy to debug
- Test with
❌ Against
-
Performance Overhead
- 5 second startup per container
- At 450 workflows/hour: 625 minutes wasted
- Unacceptable at production scale
-
Resource Churn
- Constant container creation/destruction
- Docker daemon overhead
- Network/volume setup repeated
-
Scaling Limitations
- Can't handle high-volume workloads
- Startup overhead compounds
- Poor resource utilization
Option B: Long-Lived Workers (Recommended)
✅ For
-
Zero Startup Overhead
- Containers already running
- Immediate workflow execution
- Critical for high-volume production
-
Resource Efficiency
- Fixed 4.5GB RAM handles 15 concurrent workflows
- vs ~76GB for ephemeral approach
- 10-20× better resource utilization
-
Predictable Performance
- Consistent response times
- No container startup jitter
- Better SLA capability
-
Horizontal Scaling
- Add more workers linearly
- Each worker handles N concurrent
- Clear capacity planning
-
Production-Ready
- Proven pattern (Uber, Airbnb)
- Handles thousands of workflows/day
- Industry standard for scale
❌ Against
-
Volume Mounting Complexity
- Must mount parent directories
- Or implement S3 storage backend
- More sophisticated configuration
-
Shared Container State
- Workers reused across workflows
- Potential for subtle bugs
- Requires careful module design
-
Lifecycle Management
- Must handle worker restarts
- Graceful shutdown needed
- More complex monitoring
-
Memory Management
- Workers accumulate memory over time
- Need periodic restarts
- Requires memory limits
Decision Matrix
| Scenario | Ephemeral | Long-Lived | Winner |
|---|---|---|---|
| Development | ✅ Simpler | ⚠️ Complex | Ephemeral |
| Low Volume (<10/hour) | ✅ Acceptable | ✅ Overkill | Ephemeral |
| Medium Volume (10-100/hour) | ⚠️ Wasteful | ✅ Efficient | Long-Lived |
| High Volume (>100/hour) | ❌ Unusable | ✅ Required | Long-Lived |
| Production Scale | ❌ No | ✅ Yes | Long-Lived |
Recommendation: Long-lived workers for production deployment.
Compromise: Can start with ephemeral for Phase 1 (proof of concept), migrate to long-lived for Phase 2 (production).
Future Consideration: Nomad vs Kubernetes vs Docker Compose
When to Consider Orchestration Beyond Docker Compose?
Trigger Points:
- ✅ Managing 10+ hosts manually
- ✅ Need multi-tenancy (customer isolation)
- ✅ Require auto-scaling based on metrics
- ✅ Want sophisticated scheduling (bin-packing, constraints)
Timeline Estimate: 18-24 months from now
Option A: Docker Compose (Recommended for Phase 1-2)
✅ For
-
Simplicity
- Single YAML file
- No cluster setup
- Easy to understand and debug
-
Zero Learning Curve
- Team already knows Docker
- Familiar commands
- Abundant documentation
-
Sufficient for 1-5 Hosts
- Deploy same compose file to each host
- Manual but manageable
- Works for current scale
-
Development Friendly
- Same config dev and prod
- Fast iteration cycle
- Easy local testing
-
No Lock-In
- Easy to migrate to Nomad/K8s later
- Workers portable by design
- Clean exit strategy
❌ Against
-
Manual Coordination
- No automatic scheduling
- Manual load balancing
- No health-based rescheduling
-
Limited Scaling
- Practical limit ~5-10 hosts
- No auto-scaling
- Manual capacity planning
-
No Multi-Tenancy
- Can't isolate customers
- No resource quotas
- Shared infrastructure
-
Basic Monitoring
- No cluster-wide metrics
- Per-host monitoring only
- Limited observability
Verdict: Perfect for Phase 1 (single host) and Phase 2 (3-5 hosts). Transition to Nomad/K8s at Phase 3.
Option B: Nomad (Recommended for Phase 3)
✅ For
-
Operational Simplicity
- Single binary (vs K8s complexity)
- Easy to install and maintain
- Lower operational overhead
-
Perfect Fit for Use Case
- Batch workload focus
- Resource management built-in
- Namespace support for multi-tenancy
-
Multi-Workload Support
- Containers (Docker)
- VMs (QEMU)
- Bare processes
- Java JARs
- All in one scheduler
-
Scheduling Intelligence
- Bin-packing for efficiency
- Constraint-based placement
- Affinity/anti-affinity rules
- Resource quotas per namespace
-
Easy Migration from Docker Compose
- Similar concepts
compose-to-nomadconverter tool- Workers unchanged
- 1-2 week migration
-
HashiCorp Ecosystem
- Integrates with Consul (service discovery)
- Integrates with Vault (secrets)
- Proven at scale (Cloudflare, CircleCI)
-
Auto-Scaling
- Built-in scaling policies
- Prometheus integration
- Queue-depth based scaling
- Horizontal scaling automatic
❌ Against
-
Learning Curve
- HCL syntax to learn
- New concepts (allocations, deployments)
- Consul integration complexity
-
Smaller Ecosystem
- Fewer tools than Kubernetes
- Smaller community
- Less third-party integrations
-
Network Isolation
- Less sophisticated than K8s
- Requires Consul Connect for service mesh
- Weaker network policies
-
Maturity
- Less mature than Kubernetes
- Fewer production battle stories
- Evolving feature set
Verdict: Excellent choice when outgrow Docker Compose. Simpler than K8s, perfect for FuzzForge scale.
Option C: Kubernetes
✅ For
-
Industry Standard
- Largest ecosystem
- Most third-party integrations
- Abundant expertise available
-
Feature Richness
- Sophisticated networking (Network Policies)
- Advanced scheduling
- Rich operator ecosystem
- Helm charts for everything
-
Multi-Tenancy
- Strong namespace isolation
- RBAC fine-grained
- Network policies
- Pod Security Policies
-
Massive Scale
- Proven to 5,000+ nodes
- Google-scale reliability
- Battle-tested
-
Cloud Integration
- Native on all clouds (EKS, GKE, AKS)
- Managed offerings reduce complexity
- Auto-scaling (HPA, Cluster Autoscaler)
❌ Against
-
Operational Complexity
- High learning curve
- Complex to set up and maintain
- Requires dedicated ops team
-
Resource Overhead
- Control plane resource usage
- etcd cluster management
- More moving parts
-
Overkill for Use Case
- FuzzForge is batch workload, not microservices
- Don't need K8s networking complexity
- Simpler alternatives sufficient
-
Container-Only
- Can't run VMs easily
- Can't run bare processes
- Nomad more flexible
-
Cost
- Higher operational cost
- More infrastructure required
- Steeper learning investment
Verdict: Overkill for FuzzForge. Choose only if planning 1,000+ hosts or need extensive ecosystem.
Comparison Matrix
| Feature | Docker Compose | Nomad | Kubernetes |
|---|---|---|---|
| Operational Complexity | ★☆☆☆☆ (Lowest) | ★★☆☆☆ (Low) | ★★★★☆ (High) |
| Learning Curve | ★☆☆☆☆ (Easy) | ★★★☆☆ (Medium) | ★★★★★ (Steep) |
| Setup Time | Minutes | 1 day | 1-2 weeks |
| Best For | 1-5 hosts | 10-500 hosts | 500+ hosts |
| Auto-Scaling | ❌ No | ✅ Yes | ✅ Yes |
| Multi-Tenancy | ❌ No | ✅ Yes (Namespaces) | ✅ Yes (Advanced) |
| Workload Types | Containers | Containers + VMs + Processes | Containers (mainly) |
| Service Mesh | ❌ No | ⚠️ Via Consul Connect | ✅ Istio/Linkerd |
| Ecosystem Size | Medium | Small | Huge |
| Resource Efficiency | High | High | Medium |
| FuzzForge Fit | ✅ Phase 1-2 | ✅ Phase 3+ | ⚠️ Unnecessary |
Recommendation Timeline
Months 0-6: Docker Compose (Single Host)
└─ Simplest, fastest to implement
Months 6-18: Docker Compose (Multi-Host)
└─ Scale to 3-5 hosts manually
Months 18+: Nomad (if needed)
└─ Add when 10+ hosts or auto-scaling required
Never: Kubernetes
└─ Unless scale exceeds 500+ hosts
Benefits Summary
Infrastructure Benefits
| Metric | Current (Prefect) | Future (Temporal) | Improvement |
|---|---|---|---|
| Services to Manage | 6 | 1 | 83% reduction |
| Idle Memory Usage | ~2GB | ~1GB | 50% reduction |
| Load Memory Usage | ~8GB | ~4.5GB | 44% reduction |
| Docker Registry | Required | Not needed | Eliminated |
| Configuration Files | 6 service configs | 1 config | 83% simpler |
| Deployment Complexity | High | Low | Significant |
Operational Benefits
-
Simpler Monitoring
- 1 service vs 6
- Single Web UI (Temporal)
- Fewer alerts to configure
-
Easier Debugging
- Complete workflow history in Temporal
- Query workflow state at any time
- Better error visibility
-
Faster Deployments
- No registry push/pull
- Restart 1 service vs 6
- Quicker iteration cycles
-
Better Reliability
- Durable execution (workflows survive crashes)
- Automatic retries built-in
- State persistence guaranteed
-
Clear Scaling Path
- Phase 1: Single host (now)
- Phase 2: Multi-host (6-18 months)
- Phase 3: Nomad cluster (18+ months)
Developer Experience Benefits
-
Local Development
- Simpler docker-compose
- Faster startup (fewer services)
- Easier to reason about
-
Module Development
- No changes to BaseModule
- Same discovery mechanism
- Same testing approach
-
Workflow Development
- Better debugging tools (Temporal Web UI)
- Workflow history visualization
- Easier to test retry logic
-
Onboarding
- 1 service to understand vs 6
- Clearer architecture
- Less to learn
Risks & Mitigations
Risk 1: Migration Introduces Bugs
Likelihood: Medium Impact: High Risk Score: 6/10
Mitigation:
- Phased migration (one workflow at a time)
- Parallel run (Prefect + Temporal) during transition
- Comprehensive testing before cutover
- Rollback plan documented
Risk 2: Performance Degradation
Likelihood: Low Impact: Medium Risk Score: 3/10
Mitigation:
- Load testing before production
- Monitor key metrics during migration
- Temporal proven at higher scale than current
- Easy to tune worker concurrency
Risk 3: Team Learning Curve
Likelihood: High Impact: Low Risk Score: 4/10
Mitigation:
- Training sessions on Temporal concepts
- Pair programming during migration
- Comprehensive documentation
- Temporal has excellent docs
Risk 4: Unknown Edge Cases
Likelihood: Medium Impact: Medium Risk Score: 5/10
Mitigation:
- Thorough testing with real workflows
- Gradual rollout (dev → staging → production)
- Keep Prefect running initially
- Community support available
Risk 5: Module System Incompatibility
Likelihood: Very Low Impact: High Risk Score: 2/10
Mitigation:
- Module interface preserved (BaseModule unchanged)
- Only orchestration changes
- Modules are decoupled from Prefect
- Test suite validates module behavior
Risk 6: Long-Lived Worker Stability
Likelihood: Low Impact: Medium Risk Score: 3/10
Mitigation:
- Proper resource limits (memory, CPU)
- Periodic worker restarts (daily)
- Monitoring for memory leaks
- Health checks and auto-restart
Overall Risk Assessment
Total Risk Score: 23/60 (38%) - Medium-Low Risk
Conclusion: Risks are manageable with proper planning and mitigation strategies.
Cost Analysis
Current Costs (Prefect)
Infrastructure:
Single Host (8GB RAM, 4 CPU):
- Cloud VM: $80-120/month
- Or bare metal amortized: ~$50/month
Services Running:
- Prefect Server: ~500MB
- Postgres: ~200MB
- Redis: ~100MB
- Registry: ~500MB
- Docker Proxy: ~50MB
- Worker: ~500MB
- Workflows: ~6GB (peak)
Total: ~8GB
Development Time:
- Maintenance: ~2 hours/week
- Debugging: ~3 hours/week
- Deployments: ~1 hour/week
Total: 6 hours/week = $600/month (at $25/hour)
Monthly Total: ~$700/month
Future Costs (Temporal)
Phase 1 - Single Host:
Single Host (6GB RAM, 4 CPU):
- Cloud VM: $60-80/month
- Or bare metal amortized: ~$40/month
Services Running:
- Temporal: ~1GB
- Workers: ~3.5GB
- Workflows: ~1GB (peak)
Total: ~5.5GB
Development Time:
- Maintenance: ~1 hour/week
- Debugging: ~2 hours/week
- Deployments: ~0.5 hour/week
Total: 3.5 hours/week = $350/month
Monthly Total: ~$430/month
Phase 2 - Multi-Host (3 hosts):
3 Hosts + S3 Storage:
- Cloud VMs: $180-240/month
- S3 storage (1TB): ~$23/month
- S3 transfer (100GB): ~$9/month
Development Time:
- Maintenance: ~2 hours/week
- Monitoring: ~2 hours/week
Total: 4 hours/week = $400/month
Monthly Total: ~$670/month (3× capacity)
Phase 3 - Nomad Cluster (10+ hosts):
Nomad Cluster:
- 3 Nomad servers: $120/month
- 10 worker hosts: $800/month
- S3 storage (5TB): ~$115/month
- Load balancer: ~$20/month
Development Time:
- Nomad maintenance: ~3 hours/week
- Monitoring: ~3 hours/week
Total: 6 hours/week = $600/month
Monthly Total: ~$1,655/month (10× capacity)
Cost Comparison
| Phase | Hosts | Capacity | Monthly Cost | Cost per Workflow |
|---|---|---|---|---|
| Current (Prefect) | 1 | 10K/day | $700 | $0.0023 |
| Phase 1 (Temporal) | 1 | 10K/day | $430 | $0.0014 |
| Phase 2 (Temporal) | 3 | 30K/day | $670 | $0.0007 |
| Phase 3 (Nomad) | 10 | 100K/day | $1,655 | $0.0005 |
Savings:
- Phase 1 vs Current: $270/month (39% reduction)
- Better cost efficiency as scale increases
Timeline & Effort
Phase 1: Temporal Migration (8 Weeks)
Week 1-2: Foundation
- Deploy Temporal server
- Remove Prefect infrastructure
- Implement storage abstraction layer
- Effort: 60-80 hours
Week 3-4: Workers
- Create long-lived worker pools
- Implement module auto-discovery
- Configure Docker Compose
- Effort: 60-80 hours
Week 5-6: Workflows
- Migrate workflows to Temporal
- Convert @flow → @workflow.defn
- Test all workflows
- Effort: 60-80 hours
Week 7: Integration
- Update backend API
- End-to-end testing
- Load testing
- Effort: 40-60 hours
Week 8: Documentation & Cleanup
- Update documentation
- Remove old code
- Training sessions
- Effort: 30-40 hours
Total Effort: 250-340 hours (~2 engineers for 2 months)
Phase 2: Multi-Host (When Needed)
Effort: 40-60 hours
- Set up S3 storage
- Deploy to multiple hosts
- Configure load balancing
- Test and validate
Phase 3: Nomad (If Needed)
Effort: 80-120 hours
- Install Nomad cluster
- Convert jobs to Nomad
- Set up auto-scaling
- Production deployment
Licensing Considerations
Overview
Critical Context: FuzzForge is a generic platform where modules and workflows "could be anything" - not limited to fuzzing or security analysis. This significantly impacts the licensing assessment, particularly for Nomad's Business Source License.
Temporal Licensing: ✅ SAFE
License: MIT License
Status: Fully open source, zero restrictions
Commercial Use:
- ✅ Use in production
- ✅ Sell services built on Temporal
- ✅ Modify source code
- ✅ Redistribute
- ✅ Sublicense
- ✅ Private use
Conclusion: Temporal has no licensing concerns for any use case. You can build any type of platform (fuzzing, security, generic workflows, orchestration-as-a-service) without legal risk.
Reference: https://github.com/temporalio/temporal/blob/master/LICENSE
Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION
License: Business Source License 1.1 (BSL 1.1)
Status: Source-available but with restrictions
BSL 1.1 Key Terms
Change Date: 4 years after each version release Change License: Mozilla Public License 2.0 (MPL 2.0)
After 4 years: Each version becomes fully open source under MPL 2.0
The Critical Restriction
Additional Use Grant:
You may make use of the Licensed Work, provided that you do not use
the Licensed Work for a Competitive Offering.
A "Competitive Offering" is a commercial product or service that is:
1. Substantially similar to the capabilities of the Licensed Work
2. Offered to third parties on a paid or free basis
What This Means for FuzzForge
The licensing risk depends on how FuzzForge is marketed and positioned:
✅ LIKELY SAFE: Specific Use Case Platform
If FuzzForge is marketed as a specialized platform for specific domains:
Examples:
- ✅ "FuzzForge - Security Analysis Platform"
- ✅ "FuzzForge - Automated Fuzzing Service"
- ✅ "FuzzForge - Code Analysis Tooling"
- ✅ "FuzzForge - Vulnerability Assessment Platform"
Why Safe:
- Nomad is used internally for infrastructure
- Customer is buying fuzzing/security services, not orchestration
- Platform's value is the domain expertise, not the scheduler
- Not competing with HashiCorp's offerings
⚠️ GRAY AREA: Generic Workflow Platform
If FuzzForge pivots to emphasize generic workflow capabilities:
Examples:
- ⚠️ "FuzzForge - Workflow Orchestration Platform"
- ⚠️ "FuzzForge - Run any containerized workload"
- ⚠️ "FuzzForge - Generic task scheduler"
- ⚠️ Marketing that emphasizes "powered by Nomad"
Why Risky:
- Could be seen as competing with Nomad Enterprise
- Offering similar capabilities to HashiCorp's products
- Customer might use it as Nomad replacement
❌ CLEARLY VIOLATES: Orchestration-as-a-Service
If FuzzForge becomes primarily an orchestration product:
Examples:
- ❌ "FuzzForge Orchestrator - Schedule any workload"
- ❌ "Nomad-as-a-Service powered by FuzzForge"
- ❌ "Generic container orchestration platform"
- ❌ Reselling Nomad capabilities with thin wrapper
Why Violation:
- Directly competing with HashiCorp Nomad offerings
- "Substantially similar" to Nomad's capabilities
- Commercial offering of orchestration
Real-World Precedents
HashiCorp has NOT (as of 2025) aggressively enforced BSL against companies using their tools internally. The restriction targets:
- Cloud providers offering "managed Nomad" services
- Companies building Nomad competitors
- Vendors reselling HashiCorp functionality
NOT targeting:
- Companies using Nomad for internal infrastructure
- SaaS platforms that happen to use Nomad
- Domain-specific platforms (like FuzzForge's security focus)
Decision Tree: Should I Use Nomad?
┌─────────────────────────────────────┐
│ Is orchestration your core product? │
└─────────────────────────────────────┘
│
┌────────┴────────┐
│ │
YES NO
│ │
┌────┴────┐ ┌────┴────┐
│ DON'T │ │ What's │
│ USE │ │ your │
│ NOMAD │ │ value │
│ │ │ prop? │
└─────────┘ └─────┬────┘
│
┌───────────┴───────────┐
│ │
Domain Expertise Orchestration Features
(Fuzzing, Security) (Scheduling, Auto-scale)
│ │
┌────┴────┐ ┌────┴────┐
│ SAFE TO │ │ RISKY - │
│ USE │ │ CONSULT │
│ NOMAD │ │ LAWYER │
└─────────┘ └─────────┘
FuzzForge Current Position
Current Positioning: Domain-specific security/analysis platform Nomad Usage: Internal infrastructure (not customer-facing) Risk Level: LOW (likely safe)
However, user stated: "modules and workflows could be anything" - this suggests potential future expansion beyond security domain.
If FuzzForge pivots to generic platform:
- Risk increases from LOW → MEDIUM
- Need legal review before Phase 3 (Nomad migration)
- Consider Kubernetes as alternative
Kubernetes Licensing: ✅ SAFE
License: Apache License 2.0
Status: Fully open source, zero restrictions
Commercial Use:
- ✅ Use in production
- ✅ Sell services built on Kubernetes
- ✅ Modify source code
- ✅ Offer managed Kubernetes (AWS EKS, GCP GKE do this)
- ✅ Build competitive offerings
Conclusion: Kubernetes has no licensing concerns whatsoever, even for orchestration-as-a-service offerings.
Docker Licensing: ✅ SAFE
License: Apache License 2.0
Status: Fully open source
Note: Docker Desktop has separate commercial licensing requirements for organizations >250 employees or >$10M revenue, but Docker Engine (which FuzzForge uses) remains free for all uses.
Licensing Recommendation Matrix
| Component | License | FuzzForge Risk | Recommendation |
|---|---|---|---|
| Temporal | MIT | ✅ None | Use freely |
| Docker Engine | Apache 2.0 | ✅ None | Use freely |
| Nomad | BSL 1.1 | ⚠️ Low-Medium | Safe if domain-specific |
| Kubernetes | Apache 2.0 | ✅ None | Safe alternative to Nomad |
Recommendations by Phase
Phase 1 & 2: Temporal + Docker Compose
Licenses: MIT (Temporal) + Apache 2.0 (Docker) Risk: ✅ ZERO - Fully safe for any use case
Action: Proceed without legal review required
Phase 3: Adding Nomad (18+ months)
License: BSL 1.1 Risk: ⚠️ LOW-MEDIUM - Depends on positioning
Action Required BEFORE Migration:
-
Clarify Product Positioning
- Will FuzzForge market as generic platform?
- Or remain domain-specific (security/fuzzing)?
-
Legal Review (Recommended)
- Consult IP lawyer familiar with BSL
- Show marketing materials, website copy
- Get written opinion on BSL compliance
- Cost: $2,000-5,000 (one-time)
-
Decision Point:
IF positioning = domain-specific (security/fuzzing) THEN proceed with Nomad (low risk) ELSE IF positioning = generic platform THEN consider Kubernetes instead (zero risk)
Alternative: Use Kubernetes Instead of Nomad
If concerned about Nomad BSL risk:
Pros:
- ✅ Zero licensing risk (Apache 2.0)
- ✅ Can offer orchestration-as-a-service freely
- ✅ Larger ecosystem and community
- ✅ Managed offerings on all clouds
Cons:
- ❌ Higher operational complexity than Nomad
- ❌ Overkill for batch workload use case
- ❌ Steeper learning curve
When to Choose K8s Over Nomad:
- Planning to market as generic platform
- Uncomfortable with BSL restrictions
- Need absolute licensing certainty
- Have K8s expertise already
Licensing Risk Summary
| Scenario | Temporal | Docker | Nomad | Kubernetes |
|---|---|---|---|---|
| Security platform (current) | ✅ Safe | ✅ Safe | ✅ Safe | ✅ Safe |
| Generic workflow platform | ✅ Safe | ✅ Safe | ⚠️ Risky | ✅ Safe |
| Orchestration-as-a-service | ✅ Safe | ✅ Safe | ❌ Violation | ✅ Safe |
Key Takeaways
-
Temporal is completely safe - MIT license has zero restrictions for any use case
-
Nomad's BSL depends on positioning:
- ✅ Safe for domain-specific platforms (security, fuzzing)
- ⚠️ Risky for generic workflow platforms
- ❌ Violation for orchestration-as-a-service
-
User's statement matters: "modules could be anything" suggests generic platform potential → increases Nomad risk
-
Mitigation strategies:
- Keep marketing focused on domain expertise
- Get legal review before Phase 3 (Nomad)
- Alternative: Use Kubernetes (Apache 2.0) instead
-
Decision timing: No urgency - Nomad decision is 18+ months away (Phase 3)
-
Recommended approach:
Now → Phase 1-2: Temporal + Docker Compose (zero risk) 18 months → Phase 3: Re-evaluate positioning → Domain-specific? Use Nomad → Generic platform? Use Kubernetes
Recommendation
Primary Recommendation: PROCEED WITH TEMPORAL MIGRATION
Confidence Level: High (8/10)
Rationale
-
Technical Benefits Outweigh Costs
- 83% infrastructure reduction
- 44% resource savings
- Clear scaling path
- Better reliability
-
Manageable Risks
- Low-medium risk profile
- Good mitigation strategies
- Rollback plan exists
- Module system preserved
-
Right Timing
- Before production scale makes migration harder
- Team capacity available
- Module architecture stable
- Clear 8-week timeline
-
Future-Proof
- Easy Nomad migration when needed
- Multi-host ready (storage abstraction)
- Industry-proven technology
- Growing ecosystem
Phased Approach
Immediate (Now):
- ✅ Approve Temporal migration
- ✅ Allocate 2 engineers for 8 weeks
- ✅ Set Week 1 start date
Near-Term (Months 1-6):
- ✅ Complete Temporal migration
- ✅ Validate in production
- ✅ Optimize performance
Mid-Term (Months 6-18):
- ⏳ Monitor scaling needs
- ⏳ Implement S3 storage if needed
- ⏳ Expand to multi-host if needed
Long-Term (Months 18+):
- ⏳ Evaluate Nomad necessity
- ⏳ Migrate to Nomad if triggers met
- ⏳ Continue scaling horizontally
Decision Criteria
Proceed with Migration if:
- ✅ Team agrees on benefits (CHECK)
- ✅ 8-week timeline acceptable (CHECK)
- ✅ Resources available (CHECK)
- ✅ Risk profile acceptable (CHECK)
Defer Migration if:
- ❌ Critical features launching soon (DEPENDS)
- ❌ Team capacity constrained (DEPENDS)
- ❌ Major Prefect improvements announced (UNLIKELY)
Alternative: Start Smaller
If full migration seems risky:
-
Proof of Concept (2 weeks)
- Migrate one simple workflow
- Validate Temporal locally
- Assess complexity
- Decision point: Continue or abort
-
Parallel Run (4 weeks)
- Run Temporal alongside Prefect
- Duplicate one workflow
- Compare results
- Build confidence
-
Full Migration (6 weeks)
- If POC successful, proceed
- Migrate remaining workflows
- Decommission Prefect
Total: 12 weeks (vs 8 weeks direct)
Appendix: Quick Reference
One-Page Summary
WHAT: Migrate from Prefect to Temporal WHY: Simpler (6 services → 1), more scalable, better reliability WHEN: Now (8 weeks) WHO: 2 engineers COST: $430/month (vs $700 current) = 39% savings RISK: Medium-Low (manageable) OUTCOME: Production-ready infrastructure with clear scaling path
Key Metrics
| Metric | Current | Future | Change |
|---|---|---|---|
| Services | 6 | 1 | -83% |
| Memory | 8GB | 4.5GB | -44% |
| Cost | $700/mo | $430/mo | -39% |
| Capacity | 10K/day | 10K/day | Same (Phase 1) |
| Dev Time | 6h/week | 3.5h/week | -42% |
Decision Checklist
- Review this document with team
- Discuss concerns and questions
- Vote: Proceed / Defer / Reject
- If proceed: Assign engineers
- If proceed: Set start date
- If defer: Set review date (3 months)
- If reject: Document reasons
Document Version: 1.0 Last Updated: 2025-09-30 Next Review: After decision or in 3 months