mirror of
https://github.com/FuzzingLabs/fuzzforge_ai.git
synced 2026-02-13 00:32:47 +00:00
* feat: Complete migration from Prefect to Temporal BREAKING CHANGE: Replaces Prefect workflow orchestration with Temporal ## Major Changes - Replace Prefect with Temporal for workflow orchestration - Implement vertical worker architecture (rust, android) - Replace Docker registry with MinIO for unified storage - Refactor activities to be co-located with workflows - Update all API endpoints for Temporal compatibility ## Infrastructure - New: docker-compose.temporal.yaml (Temporal + MinIO + workers) - New: workers/ directory with rust and android vertical workers - New: backend/src/temporal/ (manager, discovery) - New: backend/src/storage/ (S3-cached storage with MinIO) - New: backend/toolbox/common/ (shared storage activities) - Deleted: docker-compose.yaml (old Prefect setup) - Deleted: backend/src/core/prefect_manager.py - Deleted: backend/src/services/prefect_stats_monitor.py - Deleted: Docker registry and insecure-registries requirement ## Workflows - Migrated: security_assessment workflow to Temporal - New: rust_test workflow (example/test workflow) - Deleted: secret_detection_scan (Prefect-based, to be reimplemented) - Activities now co-located with workflows for independent testing ## API Changes - Updated: backend/src/api/workflows.py (Temporal submission) - Updated: backend/src/api/runs.py (Temporal status/results) - Updated: backend/src/main.py (727 lines, TemporalManager integration) - Updated: All 16 MCP tools to use TemporalManager ## Testing - ✅ All services healthy (Temporal, PostgreSQL, MinIO, workers, backend) - ✅ All API endpoints functional - ✅ End-to-end workflow test passed (72 findings from vulnerable_app) - ✅ MinIO storage integration working (target upload/download, results) - ✅ Worker activity discovery working (6 activities registered) - ✅ Tarball extraction working - ✅ SARIF report generation working ## Documentation - ARCHITECTURE.md: Complete Temporal architecture documentation - QUICKSTART_TEMPORAL.md: Getting started guide - MIGRATION_DECISION.md: Why we chose Temporal over Prefect - IMPLEMENTATION_STATUS.md: Migration progress tracking - workers/README.md: Worker development guide ## Dependencies - Added: temporalio>=1.6.0 - Added: boto3>=1.34.0 (MinIO S3 client) - Removed: prefect>=3.4.18 * feat: Add Python fuzzing vertical with Atheris integration This commit implements a complete Python fuzzing workflow using Atheris: ## Python Worker (workers/python/) - Dockerfile with Python 3.11, Atheris, and build tools - Generic worker.py for dynamic workflow discovery - requirements.txt with temporalio, boto3, atheris dependencies - Added to docker-compose.temporal.yaml with dedicated cache volume ## AtherisFuzzer Module (backend/toolbox/modules/fuzzer/) - Reusable module extending BaseModule - Auto-discovers fuzz targets (fuzz_*.py, *_fuzz.py, fuzz_target.py) - Recursive search to find targets in nested directories - Dynamically loads TestOneInput() function - Configurable max_iterations and timeout - Real-time stats callback support for live monitoring - Returns findings as ModuleFinding objects ## Atheris Fuzzing Workflow (backend/toolbox/workflows/atheris_fuzzing/) - Temporal workflow for orchestrating fuzzing - Downloads user code from MinIO - Executes AtherisFuzzer module - Uploads results to MinIO - Cleans up cache after execution - metadata.yaml with vertical: python for routing ## Test Project (test_projects/python_fuzz_waterfall/) - Demonstrates stateful waterfall vulnerability - main.py with check_secret() that leaks progress - fuzz_target.py with Atheris TestOneInput() harness - Complete README with usage instructions ## Backend Fixes - Fixed parameter merging in REST API endpoints (workflows.py) - Changed workflow parameter passing from positional args to kwargs (manager.py) - Default parameters now properly merged with user parameters ## Testing ✅ Worker discovered AtherisFuzzingWorkflow ✅ Workflow executed end-to-end successfully ✅ Fuzz target auto-discovered in nested directories ✅ Atheris ran 100,000 iterations ✅ Results uploaded and cache cleaned * chore: Complete Temporal migration with updated CLI/SDK/docs This commit includes all remaining Temporal migration changes: ## CLI Updates (cli/) - Updated workflow execution commands for Temporal - Enhanced error handling and exceptions - Updated dependencies in uv.lock ## SDK Updates (sdk/) - Client methods updated for Temporal workflows - Updated models for new workflow execution - Updated dependencies in uv.lock ## Documentation Updates (docs/) - Architecture documentation for Temporal - Workflow concept documentation - Resource management documentation (new) - Debugging guide (new) - Updated tutorials and how-to guides - Troubleshooting updates ## README Updates - Main README with Temporal instructions - Backend README - CLI README - SDK README ## Other - Updated IMPLEMENTATION_STATUS.md - Removed old vulnerable_app.tar.gz These changes complete the Temporal migration and ensure the CLI/SDK work correctly with the new backend. * fix: Use positional args instead of kwargs for Temporal workflows The Temporal Python SDK's start_workflow() method doesn't accept a 'kwargs' parameter. Workflows must receive parameters as positional arguments via the 'args' parameter. Changed from: args=workflow_args # Positional arguments This fixes the error: TypeError: Client.start_workflow() got an unexpected keyword argument 'kwargs' Workflows now correctly receive parameters in order: - security_assessment: [target_id, scanner_config, analyzer_config, reporter_config] - atheris_fuzzing: [target_id, target_file, max_iterations, timeout_seconds] - rust_test: [target_id, test_message] * fix: Filter metadata-only parameters from workflow arguments SecurityAssessmentWorkflow was receiving 7 arguments instead of 2-5. The issue was that target_path and volume_mode from default_parameters were being passed to the workflow, when they should only be used by the system for configuration. Now filters out metadata-only parameters (target_path, volume_mode) before passing arguments to workflow execution. * refactor: Remove Prefect leftovers and volume mounting legacy Complete cleanup of Prefect migration artifacts: Backend: - Delete registry.py and workflow_discovery.py (Prefect-specific files) - Remove Docker validation from setup.py (no longer needed) - Remove ResourceLimits and VolumeMount models - Remove target_path and volume_mode from WorkflowSubmission - Remove supported_volume_modes from API and discovery - Clean up metadata.yaml files (remove volume/path fields) - Simplify parameter filtering in manager.py SDK: - Remove volume_mode parameter from client methods - Remove ResourceLimits and VolumeMount models - Remove Prefect error patterns from docker_logs.py - Clean up WorkflowSubmission and WorkflowMetadata models CLI: - Remove Volume Modes display from workflow info All removed features are Prefect-specific or Docker volume mounting artifacts. Temporal workflows use MinIO storage exclusively. * feat: Add comprehensive test suite and benchmark infrastructure - Add 68 unit tests for fuzzer, scanner, and analyzer modules - Implement pytest-based test infrastructure with fixtures - Add 6 performance benchmarks with category-specific thresholds - Configure GitHub Actions for automated testing and benchmarking - Add test and benchmark documentation Test coverage: - AtherisFuzzer: 8 tests - CargoFuzzer: 14 tests - FileScanner: 22 tests - SecurityAnalyzer: 24 tests All tests passing (68/68) All benchmarks passing (6/6) * fix: Resolve all ruff linting violations across codebase Fixed 27 ruff violations in 12 files: - Removed unused imports (Depends, Dict, Any, Optional, etc.) - Fixed undefined workflow_info variable in workflows.py - Removed dead code with undefined variables in atheris_fuzzer.py - Changed f-string to regular string where no placeholders used All files now pass ruff checks for CI/CD compliance. * fix: Configure CI for unit tests only - Renamed docker-compose.temporal.yaml → docker-compose.yml for CI compatibility - Commented out integration-tests job (no integration tests yet) - Updated test-summary to only depend on lint and unit-tests CI will now run successfully with 68 unit tests. Integration tests can be added later. * feat: Add CI/CD integration with ephemeral deployment model Implements comprehensive CI/CD support for FuzzForge with on-demand worker management: **Worker Management (v0.7.0)** - Add WorkerManager for automatic worker lifecycle control - Auto-start workers from stopped state when workflows execute - Auto-stop workers after workflow completion - Health checks and startup timeout handling (90s default) **CI/CD Features** - `--fail-on` flag: Fail builds based on SARIF severity levels (error/warning/note/info) - `--export-sarif` flag: Export findings in SARIF 2.1.0 format - `--auto-start`/`--auto-stop` flags: Control worker lifecycle - Exit code propagation: Returns 1 on blocking findings, 0 on success **Exit Code Fix** - Add `except typer.Exit: raise` handlers at 3 critical locations - Move worker cleanup to finally block for guaranteed execution - Exit codes now propagate correctly even when build fails **CI Scripts & Examples** - ci-start.sh: Start FuzzForge services with health checks - ci-stop.sh: Clean shutdown with volume preservation option - GitHub Actions workflow example (security-scan.yml) - GitLab CI pipeline example (.gitlab-ci.example.yml) - docker-compose.ci.yml: CI-optimized compose file with profiles **OSS-Fuzz Integration** - New ossfuzz_campaign workflow for running OSS-Fuzz projects - OSS-Fuzz worker with Docker-in-Docker support - Configurable campaign duration and project selection **Documentation** - Comprehensive CI/CD integration guide (docs/how-to/cicd-integration.md) - Updated architecture docs with worker lifecycle details - Updated workspace isolation documentation - CLI README with worker management examples **SDK Enhancements** - Add get_workflow_worker_info() endpoint - Worker vertical metadata in workflow responses **Testing** - All workflows tested: security_assessment, atheris_fuzzing, secret_detection, cargo_fuzzing - All monitoring commands tested: stats, crashes, status, finding - Full CI pipeline simulation verified - Exit codes verified for success/failure scenarios Ephemeral CI/CD model: ~3-4GB RAM, ~60-90s startup, runs entirely in CI containers. * fix: Resolve ruff linting violations in CI/CD code - Remove unused variables (run_id, defaults, result) - Remove unused imports - Fix f-string without placeholders All CI/CD integration files now pass ruff checks.
1389 lines
39 KiB
Markdown
1389 lines
39 KiB
Markdown
# FuzzForge AI: Migration Decision Document
|
||
|
||
**Date:** 2025-10-01 (Updated)
|
||
**Status:** Architecture Revised - Ready for Implementation
|
||
**Decision Makers:** FuzzingLabs Team
|
||
**Recommendation:** Migrate to Temporal with Vertical Workers + MinIO
|
||
|
||
---
|
||
|
||
## 🔄 CRITICAL UPDATE (2025-10-01)
|
||
|
||
**Initial analysis was incomplete.** The original architecture document missed a critical requirement:
|
||
|
||
> **"Workflows are dynamic and have to be created without modifying the codebase"**
|
||
|
||
### What Changed
|
||
|
||
The original plan proposed "no registry needed" with long-lived workers, but failed to address how dynamic workflows with custom dependencies would work. This created a fundamental contradiction.
|
||
|
||
### Revised Architecture
|
||
|
||
**New approach: Vertical Workers + MinIO**
|
||
|
||
| Aspect | Original Plan | Revised Plan |
|
||
|--------|--------------|--------------|
|
||
| **Workers** | Generic long-lived | **Vertical-specific** (Android, Rust, Web, iOS, etc.) |
|
||
| **Toolchains** | Install per workflow | **Pre-built per vertical** |
|
||
| **Workflows** | Unclear | **Mounted as volume** (no rebuild) |
|
||
| **Storage** | LocalVolumeStorage (dev) / S3 (prod) | **MinIO everywhere** (unified) |
|
||
| **Target Access** | Host filesystem mounts | **Upload to MinIO** (secure) |
|
||
| **Registry** | Eliminated | **Eliminated** (workflows in volume, not images) |
|
||
| **Services** | 1 (Temporal only) | 6 (Temporal + MinIO + 3+ vertical workers) |
|
||
| **Memory** | "~4.5GB" | **~2.3GB** (realistic calculation) |
|
||
|
||
### Key Insights
|
||
|
||
1. **Dynamic workflows ARE compatible** with long-lived workers via volume mounting
|
||
2. **Verticals solve** the toolchain problem (pre-built, no per-workflow installs)
|
||
3. **MinIO is lightweight** (256MB with CI_CD=true) and provides unified storage
|
||
4. **No registry overhead** (workflow code mounted, not built into images)
|
||
5. **Better marketing** (sell "security verticals", not "orchestration platform")
|
||
|
||
### What This Means
|
||
|
||
- ✅ Migration still recommended
|
||
- ✅ Timeline extended to 10 weeks (from 8)
|
||
- ✅ More services but better architecture
|
||
- ✅ Addresses all original pain points
|
||
- ✅ Supports dynamic workflows correctly
|
||
|
||
**See ARCHITECTURE.md v2.0 for full details.**
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Executive Summary](#executive-summary)
|
||
2. [Current State Analysis](#current-state-analysis)
|
||
3. [Proposed Solution: Temporal Migration](#proposed-solution-temporal-migration)
|
||
4. [For & Against: Temporal vs Prefect](#for--against-temporal-vs-prefect)
|
||
5. [For & Against: Long-Lived vs Ephemeral Workers](#for--against-long-lived-vs-ephemeral-workers)
|
||
6. [Future Consideration: Nomad vs Kubernetes vs Docker Compose](#future-consideration-nomad-vs-kubernetes-vs-docker-compose)
|
||
7. [Benefits Summary](#benefits-summary)
|
||
8. [Risks & Mitigations](#risks--mitigations)
|
||
9. [Cost Analysis](#cost-analysis)
|
||
10. [Timeline & Effort](#timeline--effort)
|
||
11. [Licensing Considerations](#licensing-considerations)
|
||
12. [Recommendation](#recommendation)
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
### The Proposal
|
||
|
||
**Migrate from Prefect to Temporal** for workflow orchestration, simplifying infrastructure from 6 services to 1 while maintaining module architecture and preparing for future scale.
|
||
|
||
### Why Consider This?
|
||
|
||
Current Prefect setup has grown complex with:
|
||
- 6 services to manage (Prefect, Postgres, Redis, Registry, Docker-proxy, Worker)
|
||
- Unclear scaling path for high-volume production
|
||
- Registry overhead for module isolation
|
||
- Complex volume mounting configuration
|
||
|
||
### Key Decision Points
|
||
|
||
| Decision | Recommendation | Timeline |
|
||
|----------|---------------|----------|
|
||
| **Replace Prefect?** | ✅ Yes - with Temporal | Now (Weeks 1-8) |
|
||
| **Worker Strategy?** | ✅ Long-lived containers | Now (Weeks 3-4) |
|
||
| **Storage Strategy?** | ✅ Abstract layer (Local→S3) | Now (Week 3) |
|
||
| **Add Nomad?** | ⏳ Later - when 10+ hosts | 18-24 months |
|
||
| **Add Kubernetes?** | ❌ No - unnecessary complexity | N/A |
|
||
|
||
### Bottom Line
|
||
|
||
**Recommended:** Proceed with Temporal migration.
|
||
- **Effort:** 8 weeks, Medium complexity
|
||
- **Risk:** Low (rollback possible, modules unchanged)
|
||
- **Benefit:** 83% infrastructure reduction, clear scaling path, better reliability
|
||
|
||
---
|
||
|
||
## Current State Analysis
|
||
|
||
### Prefect Architecture (Current)
|
||
|
||
```
|
||
Infrastructure:
|
||
├─ Prefect Server (orchestration)
|
||
├─ Postgres (metadata storage)
|
||
├─ Redis (task queue)
|
||
├─ Docker Registry (image sharing)
|
||
├─ Docker Proxy (container isolation)
|
||
└─ Prefect Worker (execution)
|
||
|
||
Total: 6 services
|
||
```
|
||
|
||
### Strengths of Current Setup
|
||
|
||
| Aspect | Rating | Notes |
|
||
|--------|--------|-------|
|
||
| **Familiarity** | ✅ High | Team knows Prefect well |
|
||
| **Functionality** | ✅ Good | Workflows execute successfully |
|
||
| **Module System** | ✅ Excellent | BaseModule interface is solid |
|
||
| **Documentation** | ✅ Good | Internal docs exist |
|
||
|
||
### Pain Points
|
||
|
||
| Issue | Impact | Frequency | Severity |
|
||
|-------|--------|-----------|----------|
|
||
| **Infrastructure Complexity** | Managing 6 services | Continuous | High |
|
||
| **Registry Overhead** | Push/pull for every deployment | Every change | Medium |
|
||
| **Unclear Scaling** | How to go multi-host? | Future planning | High |
|
||
| **Resource Usage** | ~8GB under load | Continuous | Medium |
|
||
| **Volume Mounting** | Complex job_variables config | Every workflow | Medium |
|
||
|
||
### Why Change Now?
|
||
|
||
1. **Planning for Scale:** Need clear path from 1 host → multi-host → cluster
|
||
2. **Infrastructure Debt:** 6 services growing harder to maintain
|
||
3. **Better Options Available:** Temporal provides simpler, more scalable solution
|
||
4. **Module System Stable:** Can migrate orchestration without touching modules
|
||
5. **Right Time:** Before production scale makes migration harder
|
||
|
||
---
|
||
|
||
## Proposed Solution: Temporal Migration
|
||
|
||
### Target Architecture
|
||
|
||
```
|
||
Infrastructure:
|
||
├─ Temporal Server (orchestration + storage)
|
||
└─ Worker Pools (3 types, auto-discover modules)
|
||
|
||
Total: 1 service (+ workers)
|
||
```
|
||
|
||
### Migration Phases
|
||
|
||
**Phase 1: Single Host (Weeks 1-8)**
|
||
- Replace Prefect with Temporal
|
||
- Long-lived worker pools
|
||
- LocalVolumeStorage (volume mounts)
|
||
- Capacity: 15-50 concurrent workflows
|
||
|
||
**Phase 2: Multi-Host (Months 6-18)**
|
||
- Same architecture, multiple hosts
|
||
- Switch to S3CachedStorage
|
||
- Capacity: 3× Phase 1
|
||
|
||
**Phase 3: Nomad Cluster (Months 18+, if needed)**
|
||
- Add Nomad for advanced orchestration
|
||
- Auto-scaling, multi-tenancy
|
||
- Capacity: Unlimited horizontal scaling
|
||
|
||
---
|
||
|
||
## For & Against: Temporal vs Prefect
|
||
|
||
### Option A: Keep Prefect (Status Quo)
|
||
|
||
#### ✅ For (Arguments to Keep Prefect)
|
||
|
||
1. **No Migration Effort**
|
||
- Zero weeks of migration work
|
||
- No learning curve
|
||
- No risk of migration issues
|
||
|
||
2. **Team Familiarity**
|
||
- Team knows Prefect well
|
||
- Existing operational runbooks
|
||
- Established debugging patterns
|
||
|
||
3. **Working System**
|
||
- Current workflows function correctly
|
||
- No immediate technical blocker
|
||
- "If it ain't broke, don't fix it"
|
||
|
||
4. **Deferred Complexity**
|
||
- Can delay architecture decisions
|
||
- Focus on feature development
|
||
- Postpone infrastructure changes
|
||
|
||
#### ❌ Against (Arguments Against Keeping Prefect)
|
||
|
||
1. **Infrastructure Complexity**
|
||
- 6 services to manage and monitor
|
||
- Complex dependencies (Postgres, Redis, Registry)
|
||
- High operational overhead
|
||
|
||
2. **Scaling Uncertainty**
|
||
- Unclear how to scale beyond single host
|
||
- Registry becomes bottleneck at scale
|
||
- No clear multi-host story
|
||
|
||
3. **Resource Inefficiency**
|
||
- ~2GB idle, ~8GB under load
|
||
- Registry storage overhead
|
||
- Redundant service layers
|
||
|
||
4. **Technical Debt Accumulation**
|
||
- Complexity will only increase
|
||
- Harder to migrate later (more workflows)
|
||
- Missing modern features (durable execution)
|
||
|
||
5. **Prefect Ecosystem Concerns**
|
||
- Prefect 3.x changes from 2.x
|
||
- Community split (Cloud vs self-hosted)
|
||
- Uncertain long-term roadmap
|
||
|
||
### Option B: Migrate to Temporal (Recommended)
|
||
|
||
#### ✅ For (Arguments to Migrate)
|
||
|
||
1. **Dramatic Simplification**
|
||
- 6 services → 1 service (83% reduction)
|
||
- No registry needed (local images)
|
||
- Simpler volume mounting
|
||
|
||
2. **Better Reliability**
|
||
- Durable execution (workflows survive crashes)
|
||
- Built-in state persistence
|
||
- Proven at massive scale (Netflix, Uber, Snap)
|
||
|
||
3. **Clear Scaling Path**
|
||
- Single host → Multi-host → Nomad cluster
|
||
- Architecture designed for scale
|
||
- Storage abstraction enables seamless transition
|
||
|
||
4. **Superior Workflow Engine**
|
||
- True durable execution vs task queue
|
||
- Better state management
|
||
- Handles long-running workflows (fuzzing campaigns)
|
||
- Activity timeouts and retries built-in
|
||
|
||
5. **Operational Benefits**
|
||
- Better Web UI for debugging
|
||
- Comprehensive workflow history
|
||
- Query workflow state at any time
|
||
- Simpler deployment (single service)
|
||
|
||
6. **Future-Proof Architecture**
|
||
- Easy Nomad migration path (18+ months)
|
||
- Multi-tenancy ready (namespaces)
|
||
- Auto-scaling capable
|
||
- Industry momentum (growing adoption)
|
||
|
||
7. **Module Preservation**
|
||
- Zero changes to BaseModule interface
|
||
- Module discovery unchanged
|
||
- Workflows adapt easily (@flow → @workflow)
|
||
|
||
8. **Resource Efficiency**
|
||
- ~1GB idle, ~4.5GB under load
|
||
- 44% reduction in resource usage
|
||
- No registry storage overhead
|
||
|
||
#### ❌ Against (Arguments Against Migration)
|
||
|
||
1. **Migration Effort**
|
||
- 8 weeks of focused work
|
||
- Team capacity diverted from features
|
||
- Testing and validation required
|
||
|
||
2. **Learning Curve**
|
||
- New concepts (workflows vs activities)
|
||
- Different debugging approach
|
||
- Team training needed
|
||
|
||
3. **Migration Risk**
|
||
- Potential for workflow disruption
|
||
- Bugs in migration code
|
||
- Temporary performance issues
|
||
|
||
4. **Unknown Unknowns**
|
||
- May discover edge cases
|
||
- Performance characteristics differ
|
||
- Integration challenges possible
|
||
|
||
5. **Temporal Limitations**
|
||
- Less mature than Prefect in some areas
|
||
- Smaller community (growing)
|
||
- Fewer pre-built integrations
|
||
|
||
### Scoring Matrix
|
||
|
||
| Criteria | Weight | Prefect | Temporal | Winner |
|
||
|----------|--------|---------|----------|--------|
|
||
| **Infrastructure Complexity** | 25% | 3/10 | 9/10 | Temporal |
|
||
| **Scalability** | 20% | 4/10 | 9/10 | Temporal |
|
||
| **Reliability** | 20% | 7/10 | 10/10 | Temporal |
|
||
| **Migration Effort** | 15% | 10/10 | 4/10 | Prefect |
|
||
| **Team Familiarity** | 10% | 9/10 | 3/10 | Prefect |
|
||
| **Resource Efficiency** | 10% | 5/10 | 8/10 | Temporal |
|
||
| **Total** | 100% | **5.5/10** | **7.65/10** | **Temporal** |
|
||
|
||
**Conclusion:** Temporal wins on technical merit despite migration costs.
|
||
|
||
---
|
||
|
||
## For & Against: Long-Lived vs Ephemeral Workers
|
||
|
||
### Context
|
||
|
||
Workers can spawn ephemeral containers per workflow (like Prefect) or run as long-lived containers processing multiple workflows.
|
||
|
||
### Option A: Ephemeral Containers
|
||
|
||
#### ✅ For
|
||
|
||
1. **Complete Isolation**
|
||
- Each workflow in fresh container
|
||
- No state leakage between workflows
|
||
- Maximum security
|
||
|
||
2. **Automatic Cleanup**
|
||
- Containers destroyed after workflow
|
||
- No resource leaks
|
||
- Clean slate every time
|
||
|
||
3. **Matches Current Behavior**
|
||
- Similar to Prefect approach
|
||
- Easier mental model
|
||
- Less architecture change
|
||
|
||
4. **Simple Development**
|
||
- Test with `docker run`
|
||
- No complex lifecycle management
|
||
- Easy to debug
|
||
|
||
#### ❌ Against
|
||
|
||
1. **Performance Overhead**
|
||
- 5 second startup per container
|
||
- At 450 workflows/hour: 625 minutes wasted
|
||
- Unacceptable at production scale
|
||
|
||
2. **Resource Churn**
|
||
- Constant container creation/destruction
|
||
- Docker daemon overhead
|
||
- Network/volume setup repeated
|
||
|
||
3. **Scaling Limitations**
|
||
- Can't handle high-volume workloads
|
||
- Startup overhead compounds
|
||
- Poor resource utilization
|
||
|
||
### Option B: Long-Lived Workers (Recommended)
|
||
|
||
#### ✅ For
|
||
|
||
1. **Zero Startup Overhead**
|
||
- Containers already running
|
||
- Immediate workflow execution
|
||
- Critical for high-volume production
|
||
|
||
2. **Resource Efficiency**
|
||
- Fixed 4.5GB RAM handles 15 concurrent workflows
|
||
- vs ~76GB for ephemeral approach
|
||
- 10-20× better resource utilization
|
||
|
||
3. **Predictable Performance**
|
||
- Consistent response times
|
||
- No container startup jitter
|
||
- Better SLA capability
|
||
|
||
4. **Horizontal Scaling**
|
||
- Add more workers linearly
|
||
- Each worker handles N concurrent
|
||
- Clear capacity planning
|
||
|
||
5. **Production-Ready**
|
||
- Proven pattern (Uber, Airbnb)
|
||
- Handles thousands of workflows/day
|
||
- Industry standard for scale
|
||
|
||
#### ❌ Against
|
||
|
||
1. **Volume Mounting Complexity**
|
||
- Must mount parent directories
|
||
- Or implement S3 storage backend
|
||
- More sophisticated configuration
|
||
|
||
2. **Shared Container State**
|
||
- Workers reused across workflows
|
||
- Potential for subtle bugs
|
||
- Requires careful module design
|
||
|
||
3. **Lifecycle Management**
|
||
- Must handle worker restarts
|
||
- Graceful shutdown needed
|
||
- More complex monitoring
|
||
|
||
4. **Memory Management**
|
||
- Workers accumulate memory over time
|
||
- Need periodic restarts
|
||
- Requires memory limits
|
||
|
||
### Decision Matrix
|
||
|
||
| Scenario | Ephemeral | Long-Lived | Winner |
|
||
|----------|-----------|------------|--------|
|
||
| **Development** | ✅ Simpler | ⚠️ Complex | Ephemeral |
|
||
| **Low Volume (<10/hour)** | ✅ Acceptable | ✅ Overkill | Ephemeral |
|
||
| **Medium Volume (10-100/hour)** | ⚠️ Wasteful | ✅ Efficient | Long-Lived |
|
||
| **High Volume (>100/hour)** | ❌ Unusable | ✅ Required | Long-Lived |
|
||
| **Production Scale** | ❌ No | ✅ Yes | Long-Lived |
|
||
|
||
**Recommendation:** Long-lived workers for production deployment.
|
||
|
||
**Compromise:** Can start with ephemeral for Phase 1 (proof of concept), migrate to long-lived for Phase 2 (production).
|
||
|
||
---
|
||
|
||
## Future Consideration: Nomad vs Kubernetes vs Docker Compose
|
||
|
||
### When to Consider Orchestration Beyond Docker Compose?
|
||
|
||
**Trigger Points:**
|
||
- ✅ Managing 10+ hosts manually
|
||
- ✅ Need multi-tenancy (customer isolation)
|
||
- ✅ Require auto-scaling based on metrics
|
||
- ✅ Want sophisticated scheduling (bin-packing, constraints)
|
||
|
||
**Timeline Estimate:** 18-24 months from now
|
||
|
||
### Option A: Docker Compose (Recommended for Phase 1-2)
|
||
|
||
#### ✅ For
|
||
|
||
1. **Simplicity**
|
||
- Single YAML file
|
||
- No cluster setup
|
||
- Easy to understand and debug
|
||
|
||
2. **Zero Learning Curve**
|
||
- Team already knows Docker
|
||
- Familiar commands
|
||
- Abundant documentation
|
||
|
||
3. **Sufficient for 1-5 Hosts**
|
||
- Deploy same compose file to each host
|
||
- Manual but manageable
|
||
- Works for current scale
|
||
|
||
4. **Development Friendly**
|
||
- Same config dev and prod
|
||
- Fast iteration cycle
|
||
- Easy local testing
|
||
|
||
5. **No Lock-In**
|
||
- Easy to migrate to Nomad/K8s later
|
||
- Workers portable by design
|
||
- Clean exit strategy
|
||
|
||
#### ❌ Against
|
||
|
||
1. **Manual Coordination**
|
||
- No automatic scheduling
|
||
- Manual load balancing
|
||
- No health-based rescheduling
|
||
|
||
2. **Limited Scaling**
|
||
- Practical limit ~5-10 hosts
|
||
- No auto-scaling
|
||
- Manual capacity planning
|
||
|
||
3. **No Multi-Tenancy**
|
||
- Can't isolate customers
|
||
- No resource quotas
|
||
- Shared infrastructure
|
||
|
||
4. **Basic Monitoring**
|
||
- No cluster-wide metrics
|
||
- Per-host monitoring only
|
||
- Limited observability
|
||
|
||
**Verdict:** Perfect for Phase 1 (single host) and Phase 2 (3-5 hosts). Transition to Nomad/K8s at Phase 3.
|
||
|
||
### Option B: Nomad (Recommended for Phase 3)
|
||
|
||
#### ✅ For
|
||
|
||
1. **Operational Simplicity**
|
||
- Single binary (vs K8s complexity)
|
||
- Easy to install and maintain
|
||
- Lower operational overhead
|
||
|
||
2. **Perfect Fit for Use Case**
|
||
- Batch workload focus
|
||
- Resource management built-in
|
||
- Namespace support for multi-tenancy
|
||
|
||
3. **Multi-Workload Support**
|
||
- Containers (Docker)
|
||
- VMs (QEMU)
|
||
- Bare processes
|
||
- Java JARs
|
||
- All in one scheduler
|
||
|
||
4. **Scheduling Intelligence**
|
||
- Bin-packing for efficiency
|
||
- Constraint-based placement
|
||
- Affinity/anti-affinity rules
|
||
- Resource quotas per namespace
|
||
|
||
5. **Easy Migration from Docker Compose**
|
||
- Similar concepts
|
||
- `compose-to-nomad` converter tool
|
||
- Workers unchanged
|
||
- 1-2 week migration
|
||
|
||
6. **HashiCorp Ecosystem**
|
||
- Integrates with Consul (service discovery)
|
||
- Integrates with Vault (secrets)
|
||
- Proven at scale (Cloudflare, CircleCI)
|
||
|
||
7. **Auto-Scaling**
|
||
- Built-in scaling policies
|
||
- Prometheus integration
|
||
- Queue-depth based scaling
|
||
- Horizontal scaling automatic
|
||
|
||
#### ❌ Against
|
||
|
||
1. **Learning Curve**
|
||
- HCL syntax to learn
|
||
- New concepts (allocations, deployments)
|
||
- Consul integration complexity
|
||
|
||
2. **Smaller Ecosystem**
|
||
- Fewer tools than Kubernetes
|
||
- Smaller community
|
||
- Less third-party integrations
|
||
|
||
3. **Network Isolation**
|
||
- Less sophisticated than K8s
|
||
- Requires Consul Connect for service mesh
|
||
- Weaker network policies
|
||
|
||
4. **Maturity**
|
||
- Less mature than Kubernetes
|
||
- Fewer production battle stories
|
||
- Evolving feature set
|
||
|
||
**Verdict:** Excellent choice when outgrow Docker Compose. Simpler than K8s, perfect for FuzzForge scale.
|
||
|
||
### Option C: Kubernetes
|
||
|
||
#### ✅ For
|
||
|
||
1. **Industry Standard**
|
||
- Largest ecosystem
|
||
- Most third-party integrations
|
||
- Abundant expertise available
|
||
|
||
2. **Feature Richness**
|
||
- Sophisticated networking (Network Policies)
|
||
- Advanced scheduling
|
||
- Rich operator ecosystem
|
||
- Helm charts for everything
|
||
|
||
3. **Multi-Tenancy**
|
||
- Strong namespace isolation
|
||
- RBAC fine-grained
|
||
- Network policies
|
||
- Pod Security Policies
|
||
|
||
4. **Massive Scale**
|
||
- Proven to 5,000+ nodes
|
||
- Google-scale reliability
|
||
- Battle-tested
|
||
|
||
5. **Cloud Integration**
|
||
- Native on all clouds (EKS, GKE, AKS)
|
||
- Managed offerings reduce complexity
|
||
- Auto-scaling (HPA, Cluster Autoscaler)
|
||
|
||
#### ❌ Against
|
||
|
||
1. **Operational Complexity**
|
||
- High learning curve
|
||
- Complex to set up and maintain
|
||
- Requires dedicated ops team
|
||
|
||
2. **Resource Overhead**
|
||
- Control plane resource usage
|
||
- etcd cluster management
|
||
- More moving parts
|
||
|
||
3. **Overkill for Use Case**
|
||
- FuzzForge is batch workload, not microservices
|
||
- Don't need K8s networking complexity
|
||
- Simpler alternatives sufficient
|
||
|
||
4. **Container-Only**
|
||
- Can't run VMs easily
|
||
- Can't run bare processes
|
||
- Nomad more flexible
|
||
|
||
5. **Cost**
|
||
- Higher operational cost
|
||
- More infrastructure required
|
||
- Steeper learning investment
|
||
|
||
**Verdict:** Overkill for FuzzForge. Choose only if planning 1,000+ hosts or need extensive ecosystem.
|
||
|
||
### Comparison Matrix
|
||
|
||
| Feature | Docker Compose | Nomad | Kubernetes |
|
||
|---------|---------------|-------|------------|
|
||
| **Operational Complexity** | ★☆☆☆☆ (Lowest) | ★★☆☆☆ (Low) | ★★★★☆ (High) |
|
||
| **Learning Curve** | ★☆☆☆☆ (Easy) | ★★★☆☆ (Medium) | ★★★★★ (Steep) |
|
||
| **Setup Time** | Minutes | 1 day | 1-2 weeks |
|
||
| **Best For** | 1-5 hosts | 10-500 hosts | 500+ hosts |
|
||
| **Auto-Scaling** | ❌ No | ✅ Yes | ✅ Yes |
|
||
| **Multi-Tenancy** | ❌ No | ✅ Yes (Namespaces) | ✅ Yes (Advanced) |
|
||
| **Workload Types** | Containers | Containers + VMs + Processes | Containers (mainly) |
|
||
| **Service Mesh** | ❌ No | ⚠️ Via Consul Connect | ✅ Istio/Linkerd |
|
||
| **Ecosystem Size** | Medium | Small | Huge |
|
||
| **Resource Efficiency** | High | High | Medium |
|
||
| **FuzzForge Fit** | ✅ Phase 1-2 | ✅ Phase 3+ | ⚠️ Unnecessary |
|
||
|
||
### Recommendation Timeline
|
||
|
||
```
|
||
Months 0-6: Docker Compose (Single Host)
|
||
└─ Simplest, fastest to implement
|
||
|
||
Months 6-18: Docker Compose (Multi-Host)
|
||
└─ Scale to 3-5 hosts manually
|
||
|
||
Months 18+: Nomad (if needed)
|
||
└─ Add when 10+ hosts or auto-scaling required
|
||
|
||
Never: Kubernetes
|
||
└─ Unless scale exceeds 500+ hosts
|
||
```
|
||
|
||
---
|
||
|
||
## Benefits Summary
|
||
|
||
### Infrastructure Benefits
|
||
|
||
| Metric | Current (Prefect) | Future (Temporal) | Improvement |
|
||
|--------|-------------------|-------------------|-------------|
|
||
| **Services to Manage** | 6 | 1 | 83% reduction |
|
||
| **Idle Memory Usage** | ~2GB | ~1GB | 50% reduction |
|
||
| **Load Memory Usage** | ~8GB | ~4.5GB | 44% reduction |
|
||
| **Docker Registry** | Required | Not needed | Eliminated |
|
||
| **Configuration Files** | 6 service configs | 1 config | 83% simpler |
|
||
| **Deployment Complexity** | High | Low | Significant |
|
||
|
||
### Operational Benefits
|
||
|
||
1. **Simpler Monitoring**
|
||
- 1 service vs 6
|
||
- Single Web UI (Temporal)
|
||
- Fewer alerts to configure
|
||
|
||
2. **Easier Debugging**
|
||
- Complete workflow history in Temporal
|
||
- Query workflow state at any time
|
||
- Better error visibility
|
||
|
||
3. **Faster Deployments**
|
||
- No registry push/pull
|
||
- Restart 1 service vs 6
|
||
- Quicker iteration cycles
|
||
|
||
4. **Better Reliability**
|
||
- Durable execution (workflows survive crashes)
|
||
- Automatic retries built-in
|
||
- State persistence guaranteed
|
||
|
||
5. **Clear Scaling Path**
|
||
- Phase 1: Single host (now)
|
||
- Phase 2: Multi-host (6-18 months)
|
||
- Phase 3: Nomad cluster (18+ months)
|
||
|
||
### Developer Experience Benefits
|
||
|
||
1. **Local Development**
|
||
- Simpler docker-compose
|
||
- Faster startup (fewer services)
|
||
- Easier to reason about
|
||
|
||
2. **Module Development**
|
||
- No changes to BaseModule
|
||
- Same discovery mechanism
|
||
- Same testing approach
|
||
|
||
3. **Workflow Development**
|
||
- Better debugging tools (Temporal Web UI)
|
||
- Workflow history visualization
|
||
- Easier to test retry logic
|
||
|
||
4. **Onboarding**
|
||
- 1 service to understand vs 6
|
||
- Clearer architecture
|
||
- Less to learn
|
||
|
||
---
|
||
|
||
## Risks & Mitigations
|
||
|
||
### Risk 1: Migration Introduces Bugs
|
||
|
||
**Likelihood:** Medium
|
||
**Impact:** High
|
||
**Risk Score:** 6/10
|
||
|
||
**Mitigation:**
|
||
- Phased migration (one workflow at a time)
|
||
- Parallel run (Prefect + Temporal) during transition
|
||
- Comprehensive testing before cutover
|
||
- Rollback plan documented
|
||
|
||
### Risk 2: Performance Degradation
|
||
|
||
**Likelihood:** Low
|
||
**Impact:** Medium
|
||
**Risk Score:** 3/10
|
||
|
||
**Mitigation:**
|
||
- Load testing before production
|
||
- Monitor key metrics during migration
|
||
- Temporal proven at higher scale than current
|
||
- Easy to tune worker concurrency
|
||
|
||
### Risk 3: Team Learning Curve
|
||
|
||
**Likelihood:** High
|
||
**Impact:** Low
|
||
**Risk Score:** 4/10
|
||
|
||
**Mitigation:**
|
||
- Training sessions on Temporal concepts
|
||
- Pair programming during migration
|
||
- Comprehensive documentation
|
||
- Temporal has excellent docs
|
||
|
||
### Risk 4: Unknown Edge Cases
|
||
|
||
**Likelihood:** Medium
|
||
**Impact:** Medium
|
||
**Risk Score:** 5/10
|
||
|
||
**Mitigation:**
|
||
- Thorough testing with real workflows
|
||
- Gradual rollout (dev → staging → production)
|
||
- Keep Prefect running initially
|
||
- Community support available
|
||
|
||
### Risk 5: Module System Incompatibility
|
||
|
||
**Likelihood:** Very Low
|
||
**Impact:** High
|
||
**Risk Score:** 2/10
|
||
|
||
**Mitigation:**
|
||
- Module interface preserved (BaseModule unchanged)
|
||
- Only orchestration changes
|
||
- Modules are decoupled from Prefect
|
||
- Test suite validates module behavior
|
||
|
||
### Risk 6: Long-Lived Worker Stability
|
||
|
||
**Likelihood:** Low
|
||
**Impact:** Medium
|
||
**Risk Score:** 3/10
|
||
|
||
**Mitigation:**
|
||
- Proper resource limits (memory, CPU)
|
||
- Periodic worker restarts (daily)
|
||
- Monitoring for memory leaks
|
||
- Health checks and auto-restart
|
||
|
||
### Overall Risk Assessment
|
||
|
||
**Total Risk Score:** 23/60 (38%) - **Medium-Low Risk**
|
||
|
||
**Conclusion:** Risks are manageable with proper planning and mitigation strategies.
|
||
|
||
---
|
||
|
||
## Cost Analysis
|
||
|
||
### Current Costs (Prefect)
|
||
|
||
**Infrastructure:**
|
||
```
|
||
Single Host (8GB RAM, 4 CPU):
|
||
- Cloud VM: $80-120/month
|
||
- Or bare metal amortized: ~$50/month
|
||
|
||
Services Running:
|
||
- Prefect Server: ~500MB
|
||
- Postgres: ~200MB
|
||
- Redis: ~100MB
|
||
- Registry: ~500MB
|
||
- Docker Proxy: ~50MB
|
||
- Worker: ~500MB
|
||
- Workflows: ~6GB (peak)
|
||
Total: ~8GB
|
||
|
||
Development Time:
|
||
- Maintenance: ~2 hours/week
|
||
- Debugging: ~3 hours/week
|
||
- Deployments: ~1 hour/week
|
||
Total: 6 hours/week = $600/month (at $25/hour)
|
||
```
|
||
|
||
**Monthly Total:** ~$700/month
|
||
|
||
### Future Costs (Temporal)
|
||
|
||
**Phase 1 - Single Host:**
|
||
```
|
||
Single Host (6GB RAM, 4 CPU):
|
||
- Cloud VM: $60-80/month
|
||
- Or bare metal amortized: ~$40/month
|
||
|
||
Services Running:
|
||
- Temporal: ~1GB
|
||
- Workers: ~3.5GB
|
||
- Workflows: ~1GB (peak)
|
||
Total: ~5.5GB
|
||
|
||
Development Time:
|
||
- Maintenance: ~1 hour/week
|
||
- Debugging: ~2 hours/week
|
||
- Deployments: ~0.5 hour/week
|
||
Total: 3.5 hours/week = $350/month
|
||
```
|
||
|
||
**Monthly Total:** ~$430/month
|
||
|
||
**Phase 2 - Multi-Host (3 hosts):**
|
||
```
|
||
3 Hosts + S3 Storage:
|
||
- Cloud VMs: $180-240/month
|
||
- S3 storage (1TB): ~$23/month
|
||
- S3 transfer (100GB): ~$9/month
|
||
|
||
Development Time:
|
||
- Maintenance: ~2 hours/week
|
||
- Monitoring: ~2 hours/week
|
||
Total: 4 hours/week = $400/month
|
||
```
|
||
|
||
**Monthly Total:** ~$670/month (3× capacity)
|
||
|
||
**Phase 3 - Nomad Cluster (10+ hosts):**
|
||
```
|
||
Nomad Cluster:
|
||
- 3 Nomad servers: $120/month
|
||
- 10 worker hosts: $800/month
|
||
- S3 storage (5TB): ~$115/month
|
||
- Load balancer: ~$20/month
|
||
|
||
Development Time:
|
||
- Nomad maintenance: ~3 hours/week
|
||
- Monitoring: ~3 hours/week
|
||
Total: 6 hours/week = $600/month
|
||
```
|
||
|
||
**Monthly Total:** ~$1,655/month (10× capacity)
|
||
|
||
### Cost Comparison
|
||
|
||
| Phase | Hosts | Capacity | Monthly Cost | Cost per Workflow |
|
||
|-------|-------|----------|--------------|-------------------|
|
||
| **Current (Prefect)** | 1 | 10K/day | $700 | $0.0023 |
|
||
| **Phase 1 (Temporal)** | 1 | 10K/day | $430 | $0.0014 |
|
||
| **Phase 2 (Temporal)** | 3 | 30K/day | $670 | $0.0007 |
|
||
| **Phase 3 (Nomad)** | 10 | 100K/day | $1,655 | $0.0005 |
|
||
|
||
**Savings:**
|
||
- Phase 1 vs Current: **$270/month (39% reduction)**
|
||
- Better cost efficiency as scale increases
|
||
|
||
---
|
||
|
||
## Timeline & Effort
|
||
|
||
### Phase 1: Temporal Migration (8 Weeks)
|
||
|
||
**Week 1-2: Foundation**
|
||
- Deploy Temporal server
|
||
- Remove Prefect infrastructure
|
||
- Implement storage abstraction layer
|
||
- Effort: 60-80 hours
|
||
|
||
**Week 3-4: Workers**
|
||
- Create long-lived worker pools
|
||
- Implement module auto-discovery
|
||
- Configure Docker Compose
|
||
- Effort: 60-80 hours
|
||
|
||
**Week 5-6: Workflows**
|
||
- Migrate workflows to Temporal
|
||
- Convert @flow → @workflow.defn
|
||
- Test all workflows
|
||
- Effort: 60-80 hours
|
||
|
||
**Week 7: Integration**
|
||
- Update backend API
|
||
- End-to-end testing
|
||
- Load testing
|
||
- Effort: 40-60 hours
|
||
|
||
**Week 8: Documentation & Cleanup**
|
||
- Update documentation
|
||
- Remove old code
|
||
- Training sessions
|
||
- Effort: 30-40 hours
|
||
|
||
**Total Effort:** 250-340 hours (~2 engineers for 2 months)
|
||
|
||
### Phase 2: Multi-Host (When Needed)
|
||
|
||
**Effort:** 40-60 hours
|
||
- Set up S3 storage
|
||
- Deploy to multiple hosts
|
||
- Configure load balancing
|
||
- Test and validate
|
||
|
||
### Phase 3: Nomad (If Needed)
|
||
|
||
**Effort:** 80-120 hours
|
||
- Install Nomad cluster
|
||
- Convert jobs to Nomad
|
||
- Set up auto-scaling
|
||
- Production deployment
|
||
|
||
---
|
||
|
||
## Licensing Considerations
|
||
|
||
### Overview
|
||
|
||
**Critical Context:** FuzzForge is a **generic platform** where modules and workflows "could be anything" - not limited to fuzzing or security analysis. This significantly impacts the licensing assessment, particularly for Nomad's Business Source License.
|
||
|
||
### Temporal Licensing: ✅ SAFE
|
||
|
||
**License:** MIT License
|
||
|
||
**Status:** Fully open source, zero restrictions
|
||
|
||
**Commercial Use:**
|
||
- ✅ Use in production
|
||
- ✅ Sell services built on Temporal
|
||
- ✅ Modify source code
|
||
- ✅ Redistribute
|
||
- ✅ Sublicense
|
||
- ✅ Private use
|
||
|
||
**Conclusion:** Temporal has **no licensing concerns** for any use case. You can build any type of platform (fuzzing, security, generic workflows, orchestration-as-a-service) without legal risk.
|
||
|
||
**Reference:** https://github.com/temporalio/temporal/blob/master/LICENSE
|
||
|
||
---
|
||
|
||
### Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION
|
||
|
||
**License:** Business Source License 1.1 (BSL 1.1)
|
||
|
||
**Status:** Source-available but with restrictions
|
||
|
||
#### BSL 1.1 Key Terms
|
||
|
||
**Change Date:** 4 years after each version release
|
||
**Change License:** Mozilla Public License 2.0 (MPL 2.0)
|
||
|
||
**After 4 years:** Each version becomes fully open source under MPL 2.0
|
||
|
||
#### The Critical Restriction
|
||
|
||
```
|
||
Additional Use Grant:
|
||
You may make use of the Licensed Work, provided that you do not use
|
||
the Licensed Work for a Competitive Offering.
|
||
|
||
A "Competitive Offering" is a commercial product or service that is:
|
||
1. Substantially similar to the capabilities of the Licensed Work
|
||
2. Offered to third parties on a paid or free basis
|
||
```
|
||
|
||
#### What This Means for FuzzForge
|
||
|
||
**The licensing risk depends on how FuzzForge is marketed and positioned:**
|
||
|
||
##### ✅ LIKELY SAFE: Specific Use Case Platform
|
||
|
||
If FuzzForge is marketed as a **specialized platform** for specific domains:
|
||
|
||
**Examples:**
|
||
- ✅ "FuzzForge - Security Analysis Platform"
|
||
- ✅ "FuzzForge - Automated Fuzzing Service"
|
||
- ✅ "FuzzForge - Code Analysis Tooling"
|
||
- ✅ "FuzzForge - Vulnerability Assessment Platform"
|
||
|
||
**Why Safe:**
|
||
- Nomad is used **internally** for infrastructure
|
||
- Customer is buying **fuzzing/security services**, not orchestration
|
||
- Platform's value is the **domain expertise**, not the scheduler
|
||
- Not competing with HashiCorp's offerings
|
||
|
||
##### ⚠️ GRAY AREA: Generic Workflow Platform
|
||
|
||
If FuzzForge pivots to emphasize **generic workflow capabilities**:
|
||
|
||
**Examples:**
|
||
- ⚠️ "FuzzForge - Workflow Orchestration Platform"
|
||
- ⚠️ "FuzzForge - Run any containerized workload"
|
||
- ⚠️ "FuzzForge - Generic task scheduler"
|
||
- ⚠️ Marketing that emphasizes "powered by Nomad"
|
||
|
||
**Why Risky:**
|
||
- Could be seen as competing with Nomad Enterprise
|
||
- Offering similar capabilities to HashiCorp's products
|
||
- Customer might use it as Nomad replacement
|
||
|
||
##### ❌ CLEARLY VIOLATES: Orchestration-as-a-Service
|
||
|
||
If FuzzForge becomes primarily an **orchestration product**:
|
||
|
||
**Examples:**
|
||
- ❌ "FuzzForge Orchestrator - Schedule any workload"
|
||
- ❌ "Nomad-as-a-Service powered by FuzzForge"
|
||
- ❌ "Generic container orchestration platform"
|
||
- ❌ Reselling Nomad capabilities with thin wrapper
|
||
|
||
**Why Violation:**
|
||
- Directly competing with HashiCorp Nomad offerings
|
||
- "Substantially similar" to Nomad's capabilities
|
||
- Commercial offering of orchestration
|
||
|
||
#### Real-World Precedents
|
||
|
||
**HashiCorp has NOT** (as of 2025) aggressively enforced BSL against companies using their tools internally. The restriction targets:
|
||
- Cloud providers offering "managed Nomad" services
|
||
- Companies building Nomad competitors
|
||
- Vendors reselling HashiCorp functionality
|
||
|
||
**NOT targeting:**
|
||
- Companies using Nomad for internal infrastructure
|
||
- SaaS platforms that happen to use Nomad
|
||
- Domain-specific platforms (like FuzzForge's security focus)
|
||
|
||
#### Decision Tree: Should I Use Nomad?
|
||
|
||
```
|
||
┌─────────────────────────────────────┐
|
||
│ Is orchestration your core product? │
|
||
└─────────────────────────────────────┘
|
||
│
|
||
┌────────┴────────┐
|
||
│ │
|
||
YES NO
|
||
│ │
|
||
┌────┴────┐ ┌────┴────┐
|
||
│ DON'T │ │ What's │
|
||
│ USE │ │ your │
|
||
│ NOMAD │ │ value │
|
||
│ │ │ prop? │
|
||
└─────────┘ └─────┬────┘
|
||
│
|
||
┌───────────┴───────────┐
|
||
│ │
|
||
Domain Expertise Orchestration Features
|
||
(Fuzzing, Security) (Scheduling, Auto-scale)
|
||
│ │
|
||
┌────┴────┐ ┌────┴────┐
|
||
│ SAFE TO │ │ RISKY - │
|
||
│ USE │ │ CONSULT │
|
||
│ NOMAD │ │ LAWYER │
|
||
└─────────┘ └─────────┘
|
||
```
|
||
|
||
#### FuzzForge Current Position
|
||
|
||
**Current Positioning:** Domain-specific security/analysis platform
|
||
**Nomad Usage:** Internal infrastructure (not customer-facing)
|
||
**Risk Level:** **LOW** (likely safe)
|
||
|
||
**However**, user stated: _"modules and workflows could be anything"_ - this suggests potential future expansion beyond security domain.
|
||
|
||
**If FuzzForge pivots to generic platform:**
|
||
- Risk increases from LOW → MEDIUM
|
||
- Need legal review before Phase 3 (Nomad migration)
|
||
- Consider Kubernetes as alternative
|
||
|
||
---
|
||
|
||
### Kubernetes Licensing: ✅ SAFE
|
||
|
||
**License:** Apache License 2.0
|
||
|
||
**Status:** Fully open source, zero restrictions
|
||
|
||
**Commercial Use:**
|
||
- ✅ Use in production
|
||
- ✅ Sell services built on Kubernetes
|
||
- ✅ Modify source code
|
||
- ✅ Offer managed Kubernetes (AWS EKS, GCP GKE do this)
|
||
- ✅ Build competitive offerings
|
||
|
||
**Conclusion:** Kubernetes has **no licensing concerns** whatsoever, even for orchestration-as-a-service offerings.
|
||
|
||
---
|
||
|
||
### Docker Licensing: ✅ SAFE
|
||
|
||
**License:** Apache License 2.0
|
||
|
||
**Status:** Fully open source
|
||
|
||
**Note:** Docker Desktop has separate commercial licensing requirements for organizations >250 employees or >$10M revenue, but Docker Engine (which FuzzForge uses) remains free for all uses.
|
||
|
||
---
|
||
|
||
### Licensing Recommendation Matrix
|
||
|
||
| Component | License | FuzzForge Risk | Recommendation |
|
||
|-----------|---------|----------------|----------------|
|
||
| **Temporal** | MIT | ✅ None | Use freely |
|
||
| **Docker Engine** | Apache 2.0 | ✅ None | Use freely |
|
||
| **Nomad** | BSL 1.1 | ⚠️ Low-Medium | Safe if domain-specific |
|
||
| **Kubernetes** | Apache 2.0 | ✅ None | Safe alternative to Nomad |
|
||
|
||
---
|
||
|
||
### Recommendations by Phase
|
||
|
||
#### Phase 1 & 2: Temporal + Docker Compose
|
||
|
||
**Licenses:** MIT (Temporal) + Apache 2.0 (Docker)
|
||
**Risk:** ✅ **ZERO** - Fully safe for any use case
|
||
|
||
**Action:** Proceed without legal review required
|
||
|
||
---
|
||
|
||
#### Phase 3: Adding Nomad (18+ months)
|
||
|
||
**License:** BSL 1.1
|
||
**Risk:** ⚠️ **LOW-MEDIUM** - Depends on positioning
|
||
|
||
**Action Required BEFORE Migration:**
|
||
|
||
1. **Clarify Product Positioning**
|
||
- Will FuzzForge market as generic platform?
|
||
- Or remain domain-specific (security/fuzzing)?
|
||
|
||
2. **Legal Review** (Recommended)
|
||
- Consult IP lawyer familiar with BSL
|
||
- Show marketing materials, website copy
|
||
- Get written opinion on BSL compliance
|
||
- Cost: $2,000-5,000 (one-time)
|
||
|
||
3. **Decision Point:**
|
||
```
|
||
IF positioning = domain-specific (security/fuzzing)
|
||
THEN proceed with Nomad (low risk)
|
||
|
||
ELSE IF positioning = generic platform
|
||
THEN consider Kubernetes instead (zero risk)
|
||
```
|
||
|
||
---
|
||
|
||
#### Alternative: Use Kubernetes Instead of Nomad
|
||
|
||
**If concerned about Nomad BSL risk:**
|
||
|
||
**Pros:**
|
||
- ✅ Zero licensing risk (Apache 2.0)
|
||
- ✅ Can offer orchestration-as-a-service freely
|
||
- ✅ Larger ecosystem and community
|
||
- ✅ Managed offerings on all clouds
|
||
|
||
**Cons:**
|
||
- ❌ Higher operational complexity than Nomad
|
||
- ❌ Overkill for batch workload use case
|
||
- ❌ Steeper learning curve
|
||
|
||
**When to Choose K8s Over Nomad:**
|
||
- Planning to market as generic platform
|
||
- Uncomfortable with BSL restrictions
|
||
- Need absolute licensing certainty
|
||
- Have K8s expertise already
|
||
|
||
---
|
||
|
||
### Licensing Risk Summary
|
||
|
||
| Scenario | Temporal | Docker | Nomad | Kubernetes |
|
||
|----------|----------|--------|-------|------------|
|
||
| **Security platform (current)** | ✅ Safe | ✅ Safe | ✅ Safe | ✅ Safe |
|
||
| **Generic workflow platform** | ✅ Safe | ✅ Safe | ⚠️ Risky | ✅ Safe |
|
||
| **Orchestration-as-a-service** | ✅ Safe | ✅ Safe | ❌ Violation | ✅ Safe |
|
||
|
||
---
|
||
|
||
### Key Takeaways
|
||
|
||
1. **Temporal is completely safe** - MIT license has zero restrictions for any use case
|
||
|
||
2. **Nomad's BSL depends on positioning**:
|
||
- ✅ Safe for domain-specific platforms (security, fuzzing)
|
||
- ⚠️ Risky for generic workflow platforms
|
||
- ❌ Violation for orchestration-as-a-service
|
||
|
||
3. **User's statement matters**: _"modules could be anything"_ suggests generic platform potential → increases Nomad risk
|
||
|
||
4. **Mitigation strategies**:
|
||
- Keep marketing focused on domain expertise
|
||
- Get legal review before Phase 3 (Nomad)
|
||
- Alternative: Use Kubernetes (Apache 2.0) instead
|
||
|
||
5. **Decision timing**: No urgency - Nomad decision is 18+ months away (Phase 3)
|
||
|
||
6. **Recommended approach**:
|
||
```
|
||
Now → Phase 1-2: Temporal + Docker Compose (zero risk)
|
||
18 months → Phase 3: Re-evaluate positioning
|
||
→ Domain-specific? Use Nomad
|
||
→ Generic platform? Use Kubernetes
|
||
```
|
||
|
||
---
|
||
|
||
## Recommendation
|
||
|
||
### Primary Recommendation: **PROCEED WITH TEMPORAL MIGRATION**
|
||
|
||
**Confidence Level:** High (8/10)
|
||
|
||
### Rationale
|
||
|
||
1. **Technical Benefits Outweigh Costs**
|
||
- 83% infrastructure reduction
|
||
- 44% resource savings
|
||
- Clear scaling path
|
||
- Better reliability
|
||
|
||
2. **Manageable Risks**
|
||
- Low-medium risk profile
|
||
- Good mitigation strategies
|
||
- Rollback plan exists
|
||
- Module system preserved
|
||
|
||
3. **Right Timing**
|
||
- Before production scale makes migration harder
|
||
- Team capacity available
|
||
- Module architecture stable
|
||
- Clear 8-week timeline
|
||
|
||
4. **Future-Proof**
|
||
- Easy Nomad migration when needed
|
||
- Multi-host ready (storage abstraction)
|
||
- Industry-proven technology
|
||
- Growing ecosystem
|
||
|
||
### Phased Approach
|
||
|
||
**Immediate (Now):**
|
||
- ✅ Approve Temporal migration
|
||
- ✅ Allocate 2 engineers for 8 weeks
|
||
- ✅ Set Week 1 start date
|
||
|
||
**Near-Term (Months 1-6):**
|
||
- ✅ Complete Temporal migration
|
||
- ✅ Validate in production
|
||
- ✅ Optimize performance
|
||
|
||
**Mid-Term (Months 6-18):**
|
||
- ⏳ Monitor scaling needs
|
||
- ⏳ Implement S3 storage if needed
|
||
- ⏳ Expand to multi-host if needed
|
||
|
||
**Long-Term (Months 18+):**
|
||
- ⏳ Evaluate Nomad necessity
|
||
- ⏳ Migrate to Nomad if triggers met
|
||
- ⏳ Continue scaling horizontally
|
||
|
||
### Decision Criteria
|
||
|
||
**Proceed with Migration if:**
|
||
- ✅ Team agrees on benefits (CHECK)
|
||
- ✅ 8-week timeline acceptable (CHECK)
|
||
- ✅ Resources available (CHECK)
|
||
- ✅ Risk profile acceptable (CHECK)
|
||
|
||
**Defer Migration if:**
|
||
- ❌ Critical features launching soon (DEPENDS)
|
||
- ❌ Team capacity constrained (DEPENDS)
|
||
- ❌ Major Prefect improvements announced (UNLIKELY)
|
||
|
||
### Alternative: Start Smaller
|
||
|
||
**If full migration seems risky:**
|
||
|
||
1. **Proof of Concept (2 weeks)**
|
||
- Migrate one simple workflow
|
||
- Validate Temporal locally
|
||
- Assess complexity
|
||
- Decision point: Continue or abort
|
||
|
||
2. **Parallel Run (4 weeks)**
|
||
- Run Temporal alongside Prefect
|
||
- Duplicate one workflow
|
||
- Compare results
|
||
- Build confidence
|
||
|
||
3. **Full Migration (6 weeks)**
|
||
- If POC successful, proceed
|
||
- Migrate remaining workflows
|
||
- Decommission Prefect
|
||
|
||
**Total:** 12 weeks (vs 8 weeks direct)
|
||
|
||
---
|
||
|
||
## Appendix: Quick Reference
|
||
|
||
### One-Page Summary
|
||
|
||
**WHAT:** Migrate from Prefect to Temporal
|
||
**WHY:** Simpler (6 services → 1), more scalable, better reliability
|
||
**WHEN:** Now (8 weeks)
|
||
**WHO:** 2 engineers
|
||
**COST:** $430/month (vs $700 current) = 39% savings
|
||
**RISK:** Medium-Low (manageable)
|
||
**OUTCOME:** Production-ready infrastructure with clear scaling path
|
||
|
||
### Key Metrics
|
||
|
||
| Metric | Current | Future | Change |
|
||
|--------|---------|--------|--------|
|
||
| Services | 6 | 1 | -83% |
|
||
| Memory | 8GB | 4.5GB | -44% |
|
||
| Cost | $700/mo | $430/mo | -39% |
|
||
| Capacity | 10K/day | 10K/day | Same (Phase 1) |
|
||
| Dev Time | 6h/week | 3.5h/week | -42% |
|
||
|
||
### Decision Checklist
|
||
|
||
- [ ] Review this document with team
|
||
- [ ] Discuss concerns and questions
|
||
- [ ] Vote: Proceed / Defer / Reject
|
||
- [ ] If proceed: Assign engineers
|
||
- [ ] If proceed: Set start date
|
||
- [ ] If defer: Set review date (3 months)
|
||
- [ ] If reject: Document reasons
|
||
|
||
---
|
||
|
||
**Document Version:** 1.0
|
||
**Last Updated:** 2025-09-30
|
||
**Next Review:** After decision or in 3 months
|