# FuzzForge AI: Migration Decision Document **Date:** 2025-10-01 (Updated) **Status:** Architecture Revised - Ready for Implementation **Decision Makers:** FuzzingLabs Team **Recommendation:** Migrate to Temporal with Vertical Workers + MinIO --- ## πŸ”„ CRITICAL UPDATE (2025-10-01) **Initial analysis was incomplete.** The original architecture document missed a critical requirement: > **"Workflows are dynamic and have to be created without modifying the codebase"** ### What Changed The original plan proposed "no registry needed" with long-lived workers, but failed to address how dynamic workflows with custom dependencies would work. This created a fundamental contradiction. ### Revised Architecture **New approach: Vertical Workers + MinIO** | Aspect | Original Plan | Revised Plan | |--------|--------------|--------------| | **Workers** | Generic long-lived | **Vertical-specific** (Android, Rust, Web, iOS, etc.) | | **Toolchains** | Install per workflow | **Pre-built per vertical** | | **Workflows** | Unclear | **Mounted as volume** (no rebuild) | | **Storage** | LocalVolumeStorage (dev) / S3 (prod) | **MinIO everywhere** (unified) | | **Target Access** | Host filesystem mounts | **Upload to MinIO** (secure) | | **Registry** | Eliminated | **Eliminated** (workflows in volume, not images) | | **Services** | 1 (Temporal only) | 6 (Temporal + MinIO + 3+ vertical workers) | | **Memory** | "~4.5GB" | **~2.3GB** (realistic calculation) | ### Key Insights 1. **Dynamic workflows ARE compatible** with long-lived workers via volume mounting 2. **Verticals solve** the toolchain problem (pre-built, no per-workflow installs) 3. **MinIO is lightweight** (256MB with CI_CD=true) and provides unified storage 4. **No registry overhead** (workflow code mounted, not built into images) 5. **Better marketing** (sell "security verticals", not "orchestration platform") ### What This Means - βœ… Migration still recommended - βœ… Timeline extended to 10 weeks (from 8) - βœ… More services but better architecture - βœ… Addresses all original pain points - βœ… Supports dynamic workflows correctly **See ARCHITECTURE.md v2.0 for full details.** --- ## Table of Contents 1. [Executive Summary](#executive-summary) 2. [Current State Analysis](#current-state-analysis) 3. [Proposed Solution: Temporal Migration](#proposed-solution-temporal-migration) 4. [For & Against: Temporal vs Prefect](#for--against-temporal-vs-prefect) 5. [For & Against: Long-Lived vs Ephemeral Workers](#for--against-long-lived-vs-ephemeral-workers) 6. [Future Consideration: Nomad vs Kubernetes vs Docker Compose](#future-consideration-nomad-vs-kubernetes-vs-docker-compose) 7. [Benefits Summary](#benefits-summary) 8. [Risks & Mitigations](#risks--mitigations) 9. [Cost Analysis](#cost-analysis) 10. [Timeline & Effort](#timeline--effort) 11. [Licensing Considerations](#licensing-considerations) 12. [Recommendation](#recommendation) --- ## Executive Summary ### The Proposal **Migrate from Prefect to Temporal** for workflow orchestration, simplifying infrastructure from 6 services to 1 while maintaining module architecture and preparing for future scale. ### Why Consider This? Current Prefect setup has grown complex with: - 6 services to manage (Prefect, Postgres, Redis, Registry, Docker-proxy, Worker) - Unclear scaling path for high-volume production - Registry overhead for module isolation - Complex volume mounting configuration ### Key Decision Points | Decision | Recommendation | Timeline | |----------|---------------|----------| | **Replace Prefect?** | βœ… Yes - with Temporal | Now (Weeks 1-8) | | **Worker Strategy?** | βœ… Long-lived containers | Now (Weeks 3-4) | | **Storage Strategy?** | βœ… Abstract layer (Localβ†’S3) | Now (Week 3) | | **Add Nomad?** | ⏳ Later - when 10+ hosts | 18-24 months | | **Add Kubernetes?** | ❌ No - unnecessary complexity | N/A | ### Bottom Line **Recommended:** Proceed with Temporal migration. - **Effort:** 8 weeks, Medium complexity - **Risk:** Low (rollback possible, modules unchanged) - **Benefit:** 83% infrastructure reduction, clear scaling path, better reliability --- ## Current State Analysis ### Prefect Architecture (Current) ``` Infrastructure: β”œβ”€ Prefect Server (orchestration) β”œβ”€ Postgres (metadata storage) β”œβ”€ Redis (task queue) β”œβ”€ Docker Registry (image sharing) β”œβ”€ Docker Proxy (container isolation) └─ Prefect Worker (execution) Total: 6 services ``` ### Strengths of Current Setup | Aspect | Rating | Notes | |--------|--------|-------| | **Familiarity** | βœ… High | Team knows Prefect well | | **Functionality** | βœ… Good | Workflows execute successfully | | **Module System** | βœ… Excellent | BaseModule interface is solid | | **Documentation** | βœ… Good | Internal docs exist | ### Pain Points | Issue | Impact | Frequency | Severity | |-------|--------|-----------|----------| | **Infrastructure Complexity** | Managing 6 services | Continuous | High | | **Registry Overhead** | Push/pull for every deployment | Every change | Medium | | **Unclear Scaling** | How to go multi-host? | Future planning | High | | **Resource Usage** | ~8GB under load | Continuous | Medium | | **Volume Mounting** | Complex job_variables config | Every workflow | Medium | ### Why Change Now? 1. **Planning for Scale:** Need clear path from 1 host β†’ multi-host β†’ cluster 2. **Infrastructure Debt:** 6 services growing harder to maintain 3. **Better Options Available:** Temporal provides simpler, more scalable solution 4. **Module System Stable:** Can migrate orchestration without touching modules 5. **Right Time:** Before production scale makes migration harder --- ## Proposed Solution: Temporal Migration ### Target Architecture ``` Infrastructure: β”œβ”€ Temporal Server (orchestration + storage) └─ Worker Pools (3 types, auto-discover modules) Total: 1 service (+ workers) ``` ### Migration Phases **Phase 1: Single Host (Weeks 1-8)** - Replace Prefect with Temporal - Long-lived worker pools - LocalVolumeStorage (volume mounts) - Capacity: 15-50 concurrent workflows **Phase 2: Multi-Host (Months 6-18)** - Same architecture, multiple hosts - Switch to S3CachedStorage - Capacity: 3Γ— Phase 1 **Phase 3: Nomad Cluster (Months 18+, if needed)** - Add Nomad for advanced orchestration - Auto-scaling, multi-tenancy - Capacity: Unlimited horizontal scaling --- ## For & Against: Temporal vs Prefect ### Option A: Keep Prefect (Status Quo) #### βœ… For (Arguments to Keep Prefect) 1. **No Migration Effort** - Zero weeks of migration work - No learning curve - No risk of migration issues 2. **Team Familiarity** - Team knows Prefect well - Existing operational runbooks - Established debugging patterns 3. **Working System** - Current workflows function correctly - No immediate technical blocker - "If it ain't broke, don't fix it" 4. **Deferred Complexity** - Can delay architecture decisions - Focus on feature development - Postpone infrastructure changes #### ❌ Against (Arguments Against Keeping Prefect) 1. **Infrastructure Complexity** - 6 services to manage and monitor - Complex dependencies (Postgres, Redis, Registry) - High operational overhead 2. **Scaling Uncertainty** - Unclear how to scale beyond single host - Registry becomes bottleneck at scale - No clear multi-host story 3. **Resource Inefficiency** - ~2GB idle, ~8GB under load - Registry storage overhead - Redundant service layers 4. **Technical Debt Accumulation** - Complexity will only increase - Harder to migrate later (more workflows) - Missing modern features (durable execution) 5. **Prefect Ecosystem Concerns** - Prefect 3.x changes from 2.x - Community split (Cloud vs self-hosted) - Uncertain long-term roadmap ### Option B: Migrate to Temporal (Recommended) #### βœ… For (Arguments to Migrate) 1. **Dramatic Simplification** - 6 services β†’ 1 service (83% reduction) - No registry needed (local images) - Simpler volume mounting 2. **Better Reliability** - Durable execution (workflows survive crashes) - Built-in state persistence - Proven at massive scale (Netflix, Uber, Snap) 3. **Clear Scaling Path** - Single host β†’ Multi-host β†’ Nomad cluster - Architecture designed for scale - Storage abstraction enables seamless transition 4. **Superior Workflow Engine** - True durable execution vs task queue - Better state management - Handles long-running workflows (fuzzing campaigns) - Activity timeouts and retries built-in 5. **Operational Benefits** - Better Web UI for debugging - Comprehensive workflow history - Query workflow state at any time - Simpler deployment (single service) 6. **Future-Proof Architecture** - Easy Nomad migration path (18+ months) - Multi-tenancy ready (namespaces) - Auto-scaling capable - Industry momentum (growing adoption) 7. **Module Preservation** - Zero changes to BaseModule interface - Module discovery unchanged - Workflows adapt easily (@flow β†’ @workflow) 8. **Resource Efficiency** - ~1GB idle, ~4.5GB under load - 44% reduction in resource usage - No registry storage overhead #### ❌ Against (Arguments Against Migration) 1. **Migration Effort** - 8 weeks of focused work - Team capacity diverted from features - Testing and validation required 2. **Learning Curve** - New concepts (workflows vs activities) - Different debugging approach - Team training needed 3. **Migration Risk** - Potential for workflow disruption - Bugs in migration code - Temporary performance issues 4. **Unknown Unknowns** - May discover edge cases - Performance characteristics differ - Integration challenges possible 5. **Temporal Limitations** - Less mature than Prefect in some areas - Smaller community (growing) - Fewer pre-built integrations ### Scoring Matrix | Criteria | Weight | Prefect | Temporal | Winner | |----------|--------|---------|----------|--------| | **Infrastructure Complexity** | 25% | 3/10 | 9/10 | Temporal | | **Scalability** | 20% | 4/10 | 9/10 | Temporal | | **Reliability** | 20% | 7/10 | 10/10 | Temporal | | **Migration Effort** | 15% | 10/10 | 4/10 | Prefect | | **Team Familiarity** | 10% | 9/10 | 3/10 | Prefect | | **Resource Efficiency** | 10% | 5/10 | 8/10 | Temporal | | **Total** | 100% | **5.5/10** | **7.65/10** | **Temporal** | **Conclusion:** Temporal wins on technical merit despite migration costs. --- ## For & Against: Long-Lived vs Ephemeral Workers ### Context Workers can spawn ephemeral containers per workflow (like Prefect) or run as long-lived containers processing multiple workflows. ### Option A: Ephemeral Containers #### βœ… For 1. **Complete Isolation** - Each workflow in fresh container - No state leakage between workflows - Maximum security 2. **Automatic Cleanup** - Containers destroyed after workflow - No resource leaks - Clean slate every time 3. **Matches Current Behavior** - Similar to Prefect approach - Easier mental model - Less architecture change 4. **Simple Development** - Test with `docker run` - No complex lifecycle management - Easy to debug #### ❌ Against 1. **Performance Overhead** - 5 second startup per container - At 450 workflows/hour: 625 minutes wasted - Unacceptable at production scale 2. **Resource Churn** - Constant container creation/destruction - Docker daemon overhead - Network/volume setup repeated 3. **Scaling Limitations** - Can't handle high-volume workloads - Startup overhead compounds - Poor resource utilization ### Option B: Long-Lived Workers (Recommended) #### βœ… For 1. **Zero Startup Overhead** - Containers already running - Immediate workflow execution - Critical for high-volume production 2. **Resource Efficiency** - Fixed 4.5GB RAM handles 15 concurrent workflows - vs ~76GB for ephemeral approach - 10-20Γ— better resource utilization 3. **Predictable Performance** - Consistent response times - No container startup jitter - Better SLA capability 4. **Horizontal Scaling** - Add more workers linearly - Each worker handles N concurrent - Clear capacity planning 5. **Production-Ready** - Proven pattern (Uber, Airbnb) - Handles thousands of workflows/day - Industry standard for scale #### ❌ Against 1. **Volume Mounting Complexity** - Must mount parent directories - Or implement S3 storage backend - More sophisticated configuration 2. **Shared Container State** - Workers reused across workflows - Potential for subtle bugs - Requires careful module design 3. **Lifecycle Management** - Must handle worker restarts - Graceful shutdown needed - More complex monitoring 4. **Memory Management** - Workers accumulate memory over time - Need periodic restarts - Requires memory limits ### Decision Matrix | Scenario | Ephemeral | Long-Lived | Winner | |----------|-----------|------------|--------| | **Development** | βœ… Simpler | ⚠️ Complex | Ephemeral | | **Low Volume (<10/hour)** | βœ… Acceptable | βœ… Overkill | Ephemeral | | **Medium Volume (10-100/hour)** | ⚠️ Wasteful | βœ… Efficient | Long-Lived | | **High Volume (>100/hour)** | ❌ Unusable | βœ… Required | Long-Lived | | **Production Scale** | ❌ No | βœ… Yes | Long-Lived | **Recommendation:** Long-lived workers for production deployment. **Compromise:** Can start with ephemeral for Phase 1 (proof of concept), migrate to long-lived for Phase 2 (production). --- ## Future Consideration: Nomad vs Kubernetes vs Docker Compose ### When to Consider Orchestration Beyond Docker Compose? **Trigger Points:** - βœ… Managing 10+ hosts manually - βœ… Need multi-tenancy (customer isolation) - βœ… Require auto-scaling based on metrics - βœ… Want sophisticated scheduling (bin-packing, constraints) **Timeline Estimate:** 18-24 months from now ### Option A: Docker Compose (Recommended for Phase 1-2) #### βœ… For 1. **Simplicity** - Single YAML file - No cluster setup - Easy to understand and debug 2. **Zero Learning Curve** - Team already knows Docker - Familiar commands - Abundant documentation 3. **Sufficient for 1-5 Hosts** - Deploy same compose file to each host - Manual but manageable - Works for current scale 4. **Development Friendly** - Same config dev and prod - Fast iteration cycle - Easy local testing 5. **No Lock-In** - Easy to migrate to Nomad/K8s later - Workers portable by design - Clean exit strategy #### ❌ Against 1. **Manual Coordination** - No automatic scheduling - Manual load balancing - No health-based rescheduling 2. **Limited Scaling** - Practical limit ~5-10 hosts - No auto-scaling - Manual capacity planning 3. **No Multi-Tenancy** - Can't isolate customers - No resource quotas - Shared infrastructure 4. **Basic Monitoring** - No cluster-wide metrics - Per-host monitoring only - Limited observability **Verdict:** Perfect for Phase 1 (single host) and Phase 2 (3-5 hosts). Transition to Nomad/K8s at Phase 3. ### Option B: Nomad (Recommended for Phase 3) #### βœ… For 1. **Operational Simplicity** - Single binary (vs K8s complexity) - Easy to install and maintain - Lower operational overhead 2. **Perfect Fit for Use Case** - Batch workload focus - Resource management built-in - Namespace support for multi-tenancy 3. **Multi-Workload Support** - Containers (Docker) - VMs (QEMU) - Bare processes - Java JARs - All in one scheduler 4. **Scheduling Intelligence** - Bin-packing for efficiency - Constraint-based placement - Affinity/anti-affinity rules - Resource quotas per namespace 5. **Easy Migration from Docker Compose** - Similar concepts - `compose-to-nomad` converter tool - Workers unchanged - 1-2 week migration 6. **HashiCorp Ecosystem** - Integrates with Consul (service discovery) - Integrates with Vault (secrets) - Proven at scale (Cloudflare, CircleCI) 7. **Auto-Scaling** - Built-in scaling policies - Prometheus integration - Queue-depth based scaling - Horizontal scaling automatic #### ❌ Against 1. **Learning Curve** - HCL syntax to learn - New concepts (allocations, deployments) - Consul integration complexity 2. **Smaller Ecosystem** - Fewer tools than Kubernetes - Smaller community - Less third-party integrations 3. **Network Isolation** - Less sophisticated than K8s - Requires Consul Connect for service mesh - Weaker network policies 4. **Maturity** - Less mature than Kubernetes - Fewer production battle stories - Evolving feature set **Verdict:** Excellent choice when outgrow Docker Compose. Simpler than K8s, perfect for FuzzForge scale. ### Option C: Kubernetes #### βœ… For 1. **Industry Standard** - Largest ecosystem - Most third-party integrations - Abundant expertise available 2. **Feature Richness** - Sophisticated networking (Network Policies) - Advanced scheduling - Rich operator ecosystem - Helm charts for everything 3. **Multi-Tenancy** - Strong namespace isolation - RBAC fine-grained - Network policies - Pod Security Policies 4. **Massive Scale** - Proven to 5,000+ nodes - Google-scale reliability - Battle-tested 5. **Cloud Integration** - Native on all clouds (EKS, GKE, AKS) - Managed offerings reduce complexity - Auto-scaling (HPA, Cluster Autoscaler) #### ❌ Against 1. **Operational Complexity** - High learning curve - Complex to set up and maintain - Requires dedicated ops team 2. **Resource Overhead** - Control plane resource usage - etcd cluster management - More moving parts 3. **Overkill for Use Case** - FuzzForge is batch workload, not microservices - Don't need K8s networking complexity - Simpler alternatives sufficient 4. **Container-Only** - Can't run VMs easily - Can't run bare processes - Nomad more flexible 5. **Cost** - Higher operational cost - More infrastructure required - Steeper learning investment **Verdict:** Overkill for FuzzForge. Choose only if planning 1,000+ hosts or need extensive ecosystem. ### Comparison Matrix | Feature | Docker Compose | Nomad | Kubernetes | |---------|---------------|-------|------------| | **Operational Complexity** | β˜…β˜†β˜†β˜†β˜† (Lowest) | β˜…β˜…β˜†β˜†β˜† (Low) | β˜…β˜…β˜…β˜…β˜† (High) | | **Learning Curve** | β˜…β˜†β˜†β˜†β˜† (Easy) | β˜…β˜…β˜…β˜†β˜† (Medium) | β˜…β˜…β˜…β˜…β˜… (Steep) | | **Setup Time** | Minutes | 1 day | 1-2 weeks | | **Best For** | 1-5 hosts | 10-500 hosts | 500+ hosts | | **Auto-Scaling** | ❌ No | βœ… Yes | βœ… Yes | | **Multi-Tenancy** | ❌ No | βœ… Yes (Namespaces) | βœ… Yes (Advanced) | | **Workload Types** | Containers | Containers + VMs + Processes | Containers (mainly) | | **Service Mesh** | ❌ No | ⚠️ Via Consul Connect | βœ… Istio/Linkerd | | **Ecosystem Size** | Medium | Small | Huge | | **Resource Efficiency** | High | High | Medium | | **FuzzForge Fit** | βœ… Phase 1-2 | βœ… Phase 3+ | ⚠️ Unnecessary | ### Recommendation Timeline ``` Months 0-6: Docker Compose (Single Host) └─ Simplest, fastest to implement Months 6-18: Docker Compose (Multi-Host) └─ Scale to 3-5 hosts manually Months 18+: Nomad (if needed) └─ Add when 10+ hosts or auto-scaling required Never: Kubernetes └─ Unless scale exceeds 500+ hosts ``` --- ## Benefits Summary ### Infrastructure Benefits | Metric | Current (Prefect) | Future (Temporal) | Improvement | |--------|-------------------|-------------------|-------------| | **Services to Manage** | 6 | 1 | 83% reduction | | **Idle Memory Usage** | ~2GB | ~1GB | 50% reduction | | **Load Memory Usage** | ~8GB | ~4.5GB | 44% reduction | | **Docker Registry** | Required | Not needed | Eliminated | | **Configuration Files** | 6 service configs | 1 config | 83% simpler | | **Deployment Complexity** | High | Low | Significant | ### Operational Benefits 1. **Simpler Monitoring** - 1 service vs 6 - Single Web UI (Temporal) - Fewer alerts to configure 2. **Easier Debugging** - Complete workflow history in Temporal - Query workflow state at any time - Better error visibility 3. **Faster Deployments** - No registry push/pull - Restart 1 service vs 6 - Quicker iteration cycles 4. **Better Reliability** - Durable execution (workflows survive crashes) - Automatic retries built-in - State persistence guaranteed 5. **Clear Scaling Path** - Phase 1: Single host (now) - Phase 2: Multi-host (6-18 months) - Phase 3: Nomad cluster (18+ months) ### Developer Experience Benefits 1. **Local Development** - Simpler docker-compose - Faster startup (fewer services) - Easier to reason about 2. **Module Development** - No changes to BaseModule - Same discovery mechanism - Same testing approach 3. **Workflow Development** - Better debugging tools (Temporal Web UI) - Workflow history visualization - Easier to test retry logic 4. **Onboarding** - 1 service to understand vs 6 - Clearer architecture - Less to learn --- ## Risks & Mitigations ### Risk 1: Migration Introduces Bugs **Likelihood:** Medium **Impact:** High **Risk Score:** 6/10 **Mitigation:** - Phased migration (one workflow at a time) - Parallel run (Prefect + Temporal) during transition - Comprehensive testing before cutover - Rollback plan documented ### Risk 2: Performance Degradation **Likelihood:** Low **Impact:** Medium **Risk Score:** 3/10 **Mitigation:** - Load testing before production - Monitor key metrics during migration - Temporal proven at higher scale than current - Easy to tune worker concurrency ### Risk 3: Team Learning Curve **Likelihood:** High **Impact:** Low **Risk Score:** 4/10 **Mitigation:** - Training sessions on Temporal concepts - Pair programming during migration - Comprehensive documentation - Temporal has excellent docs ### Risk 4: Unknown Edge Cases **Likelihood:** Medium **Impact:** Medium **Risk Score:** 5/10 **Mitigation:** - Thorough testing with real workflows - Gradual rollout (dev β†’ staging β†’ production) - Keep Prefect running initially - Community support available ### Risk 5: Module System Incompatibility **Likelihood:** Very Low **Impact:** High **Risk Score:** 2/10 **Mitigation:** - Module interface preserved (BaseModule unchanged) - Only orchestration changes - Modules are decoupled from Prefect - Test suite validates module behavior ### Risk 6: Long-Lived Worker Stability **Likelihood:** Low **Impact:** Medium **Risk Score:** 3/10 **Mitigation:** - Proper resource limits (memory, CPU) - Periodic worker restarts (daily) - Monitoring for memory leaks - Health checks and auto-restart ### Overall Risk Assessment **Total Risk Score:** 23/60 (38%) - **Medium-Low Risk** **Conclusion:** Risks are manageable with proper planning and mitigation strategies. --- ## Cost Analysis ### Current Costs (Prefect) **Infrastructure:** ``` Single Host (8GB RAM, 4 CPU): - Cloud VM: $80-120/month - Or bare metal amortized: ~$50/month Services Running: - Prefect Server: ~500MB - Postgres: ~200MB - Redis: ~100MB - Registry: ~500MB - Docker Proxy: ~50MB - Worker: ~500MB - Workflows: ~6GB (peak) Total: ~8GB Development Time: - Maintenance: ~2 hours/week - Debugging: ~3 hours/week - Deployments: ~1 hour/week Total: 6 hours/week = $600/month (at $25/hour) ``` **Monthly Total:** ~$700/month ### Future Costs (Temporal) **Phase 1 - Single Host:** ``` Single Host (6GB RAM, 4 CPU): - Cloud VM: $60-80/month - Or bare metal amortized: ~$40/month Services Running: - Temporal: ~1GB - Workers: ~3.5GB - Workflows: ~1GB (peak) Total: ~5.5GB Development Time: - Maintenance: ~1 hour/week - Debugging: ~2 hours/week - Deployments: ~0.5 hour/week Total: 3.5 hours/week = $350/month ``` **Monthly Total:** ~$430/month **Phase 2 - Multi-Host (3 hosts):** ``` 3 Hosts + S3 Storage: - Cloud VMs: $180-240/month - S3 storage (1TB): ~$23/month - S3 transfer (100GB): ~$9/month Development Time: - Maintenance: ~2 hours/week - Monitoring: ~2 hours/week Total: 4 hours/week = $400/month ``` **Monthly Total:** ~$670/month (3Γ— capacity) **Phase 3 - Nomad Cluster (10+ hosts):** ``` Nomad Cluster: - 3 Nomad servers: $120/month - 10 worker hosts: $800/month - S3 storage (5TB): ~$115/month - Load balancer: ~$20/month Development Time: - Nomad maintenance: ~3 hours/week - Monitoring: ~3 hours/week Total: 6 hours/week = $600/month ``` **Monthly Total:** ~$1,655/month (10Γ— capacity) ### Cost Comparison | Phase | Hosts | Capacity | Monthly Cost | Cost per Workflow | |-------|-------|----------|--------------|-------------------| | **Current (Prefect)** | 1 | 10K/day | $700 | $0.0023 | | **Phase 1 (Temporal)** | 1 | 10K/day | $430 | $0.0014 | | **Phase 2 (Temporal)** | 3 | 30K/day | $670 | $0.0007 | | **Phase 3 (Nomad)** | 10 | 100K/day | $1,655 | $0.0005 | **Savings:** - Phase 1 vs Current: **$270/month (39% reduction)** - Better cost efficiency as scale increases --- ## Timeline & Effort ### Phase 1: Temporal Migration (8 Weeks) **Week 1-2: Foundation** - Deploy Temporal server - Remove Prefect infrastructure - Implement storage abstraction layer - Effort: 60-80 hours **Week 3-4: Workers** - Create long-lived worker pools - Implement module auto-discovery - Configure Docker Compose - Effort: 60-80 hours **Week 5-6: Workflows** - Migrate workflows to Temporal - Convert @flow β†’ @workflow.defn - Test all workflows - Effort: 60-80 hours **Week 7: Integration** - Update backend API - End-to-end testing - Load testing - Effort: 40-60 hours **Week 8: Documentation & Cleanup** - Update documentation - Remove old code - Training sessions - Effort: 30-40 hours **Total Effort:** 250-340 hours (~2 engineers for 2 months) ### Phase 2: Multi-Host (When Needed) **Effort:** 40-60 hours - Set up S3 storage - Deploy to multiple hosts - Configure load balancing - Test and validate ### Phase 3: Nomad (If Needed) **Effort:** 80-120 hours - Install Nomad cluster - Convert jobs to Nomad - Set up auto-scaling - Production deployment --- ## Licensing Considerations ### Overview **Critical Context:** FuzzForge is a **generic platform** where modules and workflows "could be anything" - not limited to fuzzing or security analysis. This significantly impacts the licensing assessment, particularly for Nomad's Business Source License. ### Temporal Licensing: βœ… SAFE **License:** MIT License **Status:** Fully open source, zero restrictions **Commercial Use:** - βœ… Use in production - βœ… Sell services built on Temporal - βœ… Modify source code - βœ… Redistribute - βœ… Sublicense - βœ… Private use **Conclusion:** Temporal has **no licensing concerns** for any use case. You can build any type of platform (fuzzing, security, generic workflows, orchestration-as-a-service) without legal risk. **Reference:** https://github.com/temporalio/temporal/blob/master/LICENSE --- ### Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION **License:** Business Source License 1.1 (BSL 1.1) **Status:** Source-available but with restrictions #### BSL 1.1 Key Terms **Change Date:** 4 years after each version release **Change License:** Mozilla Public License 2.0 (MPL 2.0) **After 4 years:** Each version becomes fully open source under MPL 2.0 #### The Critical Restriction ``` Additional Use Grant: You may make use of the Licensed Work, provided that you do not use the Licensed Work for a Competitive Offering. A "Competitive Offering" is a commercial product or service that is: 1. Substantially similar to the capabilities of the Licensed Work 2. Offered to third parties on a paid or free basis ``` #### What This Means for FuzzForge **The licensing risk depends on how FuzzForge is marketed and positioned:** ##### βœ… LIKELY SAFE: Specific Use Case Platform If FuzzForge is marketed as a **specialized platform** for specific domains: **Examples:** - βœ… "FuzzForge - Security Analysis Platform" - βœ… "FuzzForge - Automated Fuzzing Service" - βœ… "FuzzForge - Code Analysis Tooling" - βœ… "FuzzForge - Vulnerability Assessment Platform" **Why Safe:** - Nomad is used **internally** for infrastructure - Customer is buying **fuzzing/security services**, not orchestration - Platform's value is the **domain expertise**, not the scheduler - Not competing with HashiCorp's offerings ##### ⚠️ GRAY AREA: Generic Workflow Platform If FuzzForge pivots to emphasize **generic workflow capabilities**: **Examples:** - ⚠️ "FuzzForge - Workflow Orchestration Platform" - ⚠️ "FuzzForge - Run any containerized workload" - ⚠️ "FuzzForge - Generic task scheduler" - ⚠️ Marketing that emphasizes "powered by Nomad" **Why Risky:** - Could be seen as competing with Nomad Enterprise - Offering similar capabilities to HashiCorp's products - Customer might use it as Nomad replacement ##### ❌ CLEARLY VIOLATES: Orchestration-as-a-Service If FuzzForge becomes primarily an **orchestration product**: **Examples:** - ❌ "FuzzForge Orchestrator - Schedule any workload" - ❌ "Nomad-as-a-Service powered by FuzzForge" - ❌ "Generic container orchestration platform" - ❌ Reselling Nomad capabilities with thin wrapper **Why Violation:** - Directly competing with HashiCorp Nomad offerings - "Substantially similar" to Nomad's capabilities - Commercial offering of orchestration #### Real-World Precedents **HashiCorp has NOT** (as of 2025) aggressively enforced BSL against companies using their tools internally. The restriction targets: - Cloud providers offering "managed Nomad" services - Companies building Nomad competitors - Vendors reselling HashiCorp functionality **NOT targeting:** - Companies using Nomad for internal infrastructure - SaaS platforms that happen to use Nomad - Domain-specific platforms (like FuzzForge's security focus) #### Decision Tree: Should I Use Nomad? ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Is orchestration your core product? β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ YES NO β”‚ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚ DON'T β”‚ β”‚ What's β”‚ β”‚ USE β”‚ β”‚ your β”‚ β”‚ NOMAD β”‚ β”‚ value β”‚ β”‚ β”‚ β”‚ prop? β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ Domain Expertise Orchestration Features (Fuzzing, Security) (Scheduling, Auto-scale) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚ SAFE TO β”‚ β”‚ RISKY - β”‚ β”‚ USE β”‚ β”‚ CONSULT β”‚ β”‚ NOMAD β”‚ β”‚ LAWYER β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### FuzzForge Current Position **Current Positioning:** Domain-specific security/analysis platform **Nomad Usage:** Internal infrastructure (not customer-facing) **Risk Level:** **LOW** (likely safe) **However**, user stated: _"modules and workflows could be anything"_ - this suggests potential future expansion beyond security domain. **If FuzzForge pivots to generic platform:** - Risk increases from LOW β†’ MEDIUM - Need legal review before Phase 3 (Nomad migration) - Consider Kubernetes as alternative --- ### Kubernetes Licensing: βœ… SAFE **License:** Apache License 2.0 **Status:** Fully open source, zero restrictions **Commercial Use:** - βœ… Use in production - βœ… Sell services built on Kubernetes - βœ… Modify source code - βœ… Offer managed Kubernetes (AWS EKS, GCP GKE do this) - βœ… Build competitive offerings **Conclusion:** Kubernetes has **no licensing concerns** whatsoever, even for orchestration-as-a-service offerings. --- ### Docker Licensing: βœ… SAFE **License:** Apache License 2.0 **Status:** Fully open source **Note:** Docker Desktop has separate commercial licensing requirements for organizations >250 employees or >$10M revenue, but Docker Engine (which FuzzForge uses) remains free for all uses. --- ### Licensing Recommendation Matrix | Component | License | FuzzForge Risk | Recommendation | |-----------|---------|----------------|----------------| | **Temporal** | MIT | βœ… None | Use freely | | **Docker Engine** | Apache 2.0 | βœ… None | Use freely | | **Nomad** | BSL 1.1 | ⚠️ Low-Medium | Safe if domain-specific | | **Kubernetes** | Apache 2.0 | βœ… None | Safe alternative to Nomad | --- ### Recommendations by Phase #### Phase 1 & 2: Temporal + Docker Compose **Licenses:** MIT (Temporal) + Apache 2.0 (Docker) **Risk:** βœ… **ZERO** - Fully safe for any use case **Action:** Proceed without legal review required --- #### Phase 3: Adding Nomad (18+ months) **License:** BSL 1.1 **Risk:** ⚠️ **LOW-MEDIUM** - Depends on positioning **Action Required BEFORE Migration:** 1. **Clarify Product Positioning** - Will FuzzForge market as generic platform? - Or remain domain-specific (security/fuzzing)? 2. **Legal Review** (Recommended) - Consult IP lawyer familiar with BSL - Show marketing materials, website copy - Get written opinion on BSL compliance - Cost: $2,000-5,000 (one-time) 3. **Decision Point:** ``` IF positioning = domain-specific (security/fuzzing) THEN proceed with Nomad (low risk) ELSE IF positioning = generic platform THEN consider Kubernetes instead (zero risk) ``` --- #### Alternative: Use Kubernetes Instead of Nomad **If concerned about Nomad BSL risk:** **Pros:** - βœ… Zero licensing risk (Apache 2.0) - βœ… Can offer orchestration-as-a-service freely - βœ… Larger ecosystem and community - βœ… Managed offerings on all clouds **Cons:** - ❌ Higher operational complexity than Nomad - ❌ Overkill for batch workload use case - ❌ Steeper learning curve **When to Choose K8s Over Nomad:** - Planning to market as generic platform - Uncomfortable with BSL restrictions - Need absolute licensing certainty - Have K8s expertise already --- ### Licensing Risk Summary | Scenario | Temporal | Docker | Nomad | Kubernetes | |----------|----------|--------|-------|------------| | **Security platform (current)** | βœ… Safe | βœ… Safe | βœ… Safe | βœ… Safe | | **Generic workflow platform** | βœ… Safe | βœ… Safe | ⚠️ Risky | βœ… Safe | | **Orchestration-as-a-service** | βœ… Safe | βœ… Safe | ❌ Violation | βœ… Safe | --- ### Key Takeaways 1. **Temporal is completely safe** - MIT license has zero restrictions for any use case 2. **Nomad's BSL depends on positioning**: - βœ… Safe for domain-specific platforms (security, fuzzing) - ⚠️ Risky for generic workflow platforms - ❌ Violation for orchestration-as-a-service 3. **User's statement matters**: _"modules could be anything"_ suggests generic platform potential β†’ increases Nomad risk 4. **Mitigation strategies**: - Keep marketing focused on domain expertise - Get legal review before Phase 3 (Nomad) - Alternative: Use Kubernetes (Apache 2.0) instead 5. **Decision timing**: No urgency - Nomad decision is 18+ months away (Phase 3) 6. **Recommended approach**: ``` Now β†’ Phase 1-2: Temporal + Docker Compose (zero risk) 18 months β†’ Phase 3: Re-evaluate positioning β†’ Domain-specific? Use Nomad β†’ Generic platform? Use Kubernetes ``` --- ## Recommendation ### Primary Recommendation: **PROCEED WITH TEMPORAL MIGRATION** **Confidence Level:** High (8/10) ### Rationale 1. **Technical Benefits Outweigh Costs** - 83% infrastructure reduction - 44% resource savings - Clear scaling path - Better reliability 2. **Manageable Risks** - Low-medium risk profile - Good mitigation strategies - Rollback plan exists - Module system preserved 3. **Right Timing** - Before production scale makes migration harder - Team capacity available - Module architecture stable - Clear 8-week timeline 4. **Future-Proof** - Easy Nomad migration when needed - Multi-host ready (storage abstraction) - Industry-proven technology - Growing ecosystem ### Phased Approach **Immediate (Now):** - βœ… Approve Temporal migration - βœ… Allocate 2 engineers for 8 weeks - βœ… Set Week 1 start date **Near-Term (Months 1-6):** - βœ… Complete Temporal migration - βœ… Validate in production - βœ… Optimize performance **Mid-Term (Months 6-18):** - ⏳ Monitor scaling needs - ⏳ Implement S3 storage if needed - ⏳ Expand to multi-host if needed **Long-Term (Months 18+):** - ⏳ Evaluate Nomad necessity - ⏳ Migrate to Nomad if triggers met - ⏳ Continue scaling horizontally ### Decision Criteria **Proceed with Migration if:** - βœ… Team agrees on benefits (CHECK) - βœ… 8-week timeline acceptable (CHECK) - βœ… Resources available (CHECK) - βœ… Risk profile acceptable (CHECK) **Defer Migration if:** - ❌ Critical features launching soon (DEPENDS) - ❌ Team capacity constrained (DEPENDS) - ❌ Major Prefect improvements announced (UNLIKELY) ### Alternative: Start Smaller **If full migration seems risky:** 1. **Proof of Concept (2 weeks)** - Migrate one simple workflow - Validate Temporal locally - Assess complexity - Decision point: Continue or abort 2. **Parallel Run (4 weeks)** - Run Temporal alongside Prefect - Duplicate one workflow - Compare results - Build confidence 3. **Full Migration (6 weeks)** - If POC successful, proceed - Migrate remaining workflows - Decommission Prefect **Total:** 12 weeks (vs 8 weeks direct) --- ## Appendix: Quick Reference ### One-Page Summary **WHAT:** Migrate from Prefect to Temporal **WHY:** Simpler (6 services β†’ 1), more scalable, better reliability **WHEN:** Now (8 weeks) **WHO:** 2 engineers **COST:** $430/month (vs $700 current) = 39% savings **RISK:** Medium-Low (manageable) **OUTCOME:** Production-ready infrastructure with clear scaling path ### Key Metrics | Metric | Current | Future | Change | |--------|---------|--------|--------| | Services | 6 | 1 | -83% | | Memory | 8GB | 4.5GB | -44% | | Cost | $700/mo | $430/mo | -39% | | Capacity | 10K/day | 10K/day | Same (Phase 1) | | Dev Time | 6h/week | 3.5h/week | -42% | ### Decision Checklist - [ ] Review this document with team - [ ] Discuss concerns and questions - [ ] Vote: Proceed / Defer / Reject - [ ] If proceed: Assign engineers - [ ] If proceed: Set start date - [ ] If defer: Set review date (3 months) - [ ] If reject: Document reasons --- **Document Version:** 1.0 **Last Updated:** 2025-09-30 **Next Review:** After decision or in 3 months