Files
fuzzforge_ai/MIGRATION_DECISION.md
tduhamel42 60ca088ecf CI/CD Integration with Ephemeral Deployment Model (#14)
* feat: Complete migration from Prefect to Temporal

BREAKING CHANGE: Replaces Prefect workflow orchestration with Temporal

## Major Changes
- Replace Prefect with Temporal for workflow orchestration
- Implement vertical worker architecture (rust, android)
- Replace Docker registry with MinIO for unified storage
- Refactor activities to be co-located with workflows
- Update all API endpoints for Temporal compatibility

## Infrastructure
- New: docker-compose.temporal.yaml (Temporal + MinIO + workers)
- New: workers/ directory with rust and android vertical workers
- New: backend/src/temporal/ (manager, discovery)
- New: backend/src/storage/ (S3-cached storage with MinIO)
- New: backend/toolbox/common/ (shared storage activities)
- Deleted: docker-compose.yaml (old Prefect setup)
- Deleted: backend/src/core/prefect_manager.py
- Deleted: backend/src/services/prefect_stats_monitor.py
- Deleted: Docker registry and insecure-registries requirement

## Workflows
- Migrated: security_assessment workflow to Temporal
- New: rust_test workflow (example/test workflow)
- Deleted: secret_detection_scan (Prefect-based, to be reimplemented)
- Activities now co-located with workflows for independent testing

## API Changes
- Updated: backend/src/api/workflows.py (Temporal submission)
- Updated: backend/src/api/runs.py (Temporal status/results)
- Updated: backend/src/main.py (727 lines, TemporalManager integration)
- Updated: All 16 MCP tools to use TemporalManager

## Testing
-  All services healthy (Temporal, PostgreSQL, MinIO, workers, backend)
-  All API endpoints functional
-  End-to-end workflow test passed (72 findings from vulnerable_app)
-  MinIO storage integration working (target upload/download, results)
-  Worker activity discovery working (6 activities registered)
-  Tarball extraction working
-  SARIF report generation working

## Documentation
- ARCHITECTURE.md: Complete Temporal architecture documentation
- QUICKSTART_TEMPORAL.md: Getting started guide
- MIGRATION_DECISION.md: Why we chose Temporal over Prefect
- IMPLEMENTATION_STATUS.md: Migration progress tracking
- workers/README.md: Worker development guide

## Dependencies
- Added: temporalio>=1.6.0
- Added: boto3>=1.34.0 (MinIO S3 client)
- Removed: prefect>=3.4.18

* feat: Add Python fuzzing vertical with Atheris integration

This commit implements a complete Python fuzzing workflow using Atheris:

## Python Worker (workers/python/)
- Dockerfile with Python 3.11, Atheris, and build tools
- Generic worker.py for dynamic workflow discovery
- requirements.txt with temporalio, boto3, atheris dependencies
- Added to docker-compose.temporal.yaml with dedicated cache volume

## AtherisFuzzer Module (backend/toolbox/modules/fuzzer/)
- Reusable module extending BaseModule
- Auto-discovers fuzz targets (fuzz_*.py, *_fuzz.py, fuzz_target.py)
- Recursive search to find targets in nested directories
- Dynamically loads TestOneInput() function
- Configurable max_iterations and timeout
- Real-time stats callback support for live monitoring
- Returns findings as ModuleFinding objects

## Atheris Fuzzing Workflow (backend/toolbox/workflows/atheris_fuzzing/)
- Temporal workflow for orchestrating fuzzing
- Downloads user code from MinIO
- Executes AtherisFuzzer module
- Uploads results to MinIO
- Cleans up cache after execution
- metadata.yaml with vertical: python for routing

## Test Project (test_projects/python_fuzz_waterfall/)
- Demonstrates stateful waterfall vulnerability
- main.py with check_secret() that leaks progress
- fuzz_target.py with Atheris TestOneInput() harness
- Complete README with usage instructions

## Backend Fixes
- Fixed parameter merging in REST API endpoints (workflows.py)
- Changed workflow parameter passing from positional args to kwargs (manager.py)
- Default parameters now properly merged with user parameters

## Testing
 Worker discovered AtherisFuzzingWorkflow
 Workflow executed end-to-end successfully
 Fuzz target auto-discovered in nested directories
 Atheris ran 100,000 iterations
 Results uploaded and cache cleaned

* chore: Complete Temporal migration with updated CLI/SDK/docs

This commit includes all remaining Temporal migration changes:

## CLI Updates (cli/)
- Updated workflow execution commands for Temporal
- Enhanced error handling and exceptions
- Updated dependencies in uv.lock

## SDK Updates (sdk/)
- Client methods updated for Temporal workflows
- Updated models for new workflow execution
- Updated dependencies in uv.lock

## Documentation Updates (docs/)
- Architecture documentation for Temporal
- Workflow concept documentation
- Resource management documentation (new)
- Debugging guide (new)
- Updated tutorials and how-to guides
- Troubleshooting updates

## README Updates
- Main README with Temporal instructions
- Backend README
- CLI README
- SDK README

## Other
- Updated IMPLEMENTATION_STATUS.md
- Removed old vulnerable_app.tar.gz

These changes complete the Temporal migration and ensure the
CLI/SDK work correctly with the new backend.

* fix: Use positional args instead of kwargs for Temporal workflows

The Temporal Python SDK's start_workflow() method doesn't accept
a 'kwargs' parameter. Workflows must receive parameters as positional
arguments via the 'args' parameter.

Changed from:
  args=workflow_args  # Positional arguments

This fixes the error:
  TypeError: Client.start_workflow() got an unexpected keyword argument 'kwargs'

Workflows now correctly receive parameters in order:
- security_assessment: [target_id, scanner_config, analyzer_config, reporter_config]
- atheris_fuzzing: [target_id, target_file, max_iterations, timeout_seconds]
- rust_test: [target_id, test_message]

* fix: Filter metadata-only parameters from workflow arguments

SecurityAssessmentWorkflow was receiving 7 arguments instead of 2-5.
The issue was that target_path and volume_mode from default_parameters
were being passed to the workflow, when they should only be used by
the system for configuration.

Now filters out metadata-only parameters (target_path, volume_mode)
before passing arguments to workflow execution.

* refactor: Remove Prefect leftovers and volume mounting legacy

Complete cleanup of Prefect migration artifacts:

Backend:
- Delete registry.py and workflow_discovery.py (Prefect-specific files)
- Remove Docker validation from setup.py (no longer needed)
- Remove ResourceLimits and VolumeMount models
- Remove target_path and volume_mode from WorkflowSubmission
- Remove supported_volume_modes from API and discovery
- Clean up metadata.yaml files (remove volume/path fields)
- Simplify parameter filtering in manager.py

SDK:
- Remove volume_mode parameter from client methods
- Remove ResourceLimits and VolumeMount models
- Remove Prefect error patterns from docker_logs.py
- Clean up WorkflowSubmission and WorkflowMetadata models

CLI:
- Remove Volume Modes display from workflow info

All removed features are Prefect-specific or Docker volume mounting
artifacts. Temporal workflows use MinIO storage exclusively.

* feat: Add comprehensive test suite and benchmark infrastructure

- Add 68 unit tests for fuzzer, scanner, and analyzer modules
- Implement pytest-based test infrastructure with fixtures
- Add 6 performance benchmarks with category-specific thresholds
- Configure GitHub Actions for automated testing and benchmarking
- Add test and benchmark documentation

Test coverage:
- AtherisFuzzer: 8 tests
- CargoFuzzer: 14 tests
- FileScanner: 22 tests
- SecurityAnalyzer: 24 tests

All tests passing (68/68)
All benchmarks passing (6/6)

* fix: Resolve all ruff linting violations across codebase

Fixed 27 ruff violations in 12 files:
- Removed unused imports (Depends, Dict, Any, Optional, etc.)
- Fixed undefined workflow_info variable in workflows.py
- Removed dead code with undefined variables in atheris_fuzzer.py
- Changed f-string to regular string where no placeholders used

All files now pass ruff checks for CI/CD compliance.

* fix: Configure CI for unit tests only

- Renamed docker-compose.temporal.yaml → docker-compose.yml for CI compatibility
- Commented out integration-tests job (no integration tests yet)
- Updated test-summary to only depend on lint and unit-tests

CI will now run successfully with 68 unit tests. Integration tests can be added later.

* feat: Add CI/CD integration with ephemeral deployment model

Implements comprehensive CI/CD support for FuzzForge with on-demand worker management:

**Worker Management (v0.7.0)**
- Add WorkerManager for automatic worker lifecycle control
- Auto-start workers from stopped state when workflows execute
- Auto-stop workers after workflow completion
- Health checks and startup timeout handling (90s default)

**CI/CD Features**
- `--fail-on` flag: Fail builds based on SARIF severity levels (error/warning/note/info)
- `--export-sarif` flag: Export findings in SARIF 2.1.0 format
- `--auto-start`/`--auto-stop` flags: Control worker lifecycle
- Exit code propagation: Returns 1 on blocking findings, 0 on success

**Exit Code Fix**
- Add `except typer.Exit: raise` handlers at 3 critical locations
- Move worker cleanup to finally block for guaranteed execution
- Exit codes now propagate correctly even when build fails

**CI Scripts & Examples**
- ci-start.sh: Start FuzzForge services with health checks
- ci-stop.sh: Clean shutdown with volume preservation option
- GitHub Actions workflow example (security-scan.yml)
- GitLab CI pipeline example (.gitlab-ci.example.yml)
- docker-compose.ci.yml: CI-optimized compose file with profiles

**OSS-Fuzz Integration**
- New ossfuzz_campaign workflow for running OSS-Fuzz projects
- OSS-Fuzz worker with Docker-in-Docker support
- Configurable campaign duration and project selection

**Documentation**
- Comprehensive CI/CD integration guide (docs/how-to/cicd-integration.md)
- Updated architecture docs with worker lifecycle details
- Updated workspace isolation documentation
- CLI README with worker management examples

**SDK Enhancements**
- Add get_workflow_worker_info() endpoint
- Worker vertical metadata in workflow responses

**Testing**
- All workflows tested: security_assessment, atheris_fuzzing, secret_detection, cargo_fuzzing
- All monitoring commands tested: stats, crashes, status, finding
- Full CI pipeline simulation verified
- Exit codes verified for success/failure scenarios

Ephemeral CI/CD model: ~3-4GB RAM, ~60-90s startup, runs entirely in CI containers.

* fix: Resolve ruff linting violations in CI/CD code

- Remove unused variables (run_id, defaults, result)
- Remove unused imports
- Fix f-string without placeholders

All CI/CD integration files now pass ruff checks.
2025-10-14 10:13:45 +02:00

1389 lines
39 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# FuzzForge AI: Migration Decision Document
**Date:** 2025-10-01 (Updated)
**Status:** Architecture Revised - Ready for Implementation
**Decision Makers:** FuzzingLabs Team
**Recommendation:** Migrate to Temporal with Vertical Workers + MinIO
---
## 🔄 CRITICAL UPDATE (2025-10-01)
**Initial analysis was incomplete.** The original architecture document missed a critical requirement:
> **"Workflows are dynamic and have to be created without modifying the codebase"**
### What Changed
The original plan proposed "no registry needed" with long-lived workers, but failed to address how dynamic workflows with custom dependencies would work. This created a fundamental contradiction.
### Revised Architecture
**New approach: Vertical Workers + MinIO**
| Aspect | Original Plan | Revised Plan |
|--------|--------------|--------------|
| **Workers** | Generic long-lived | **Vertical-specific** (Android, Rust, Web, iOS, etc.) |
| **Toolchains** | Install per workflow | **Pre-built per vertical** |
| **Workflows** | Unclear | **Mounted as volume** (no rebuild) |
| **Storage** | LocalVolumeStorage (dev) / S3 (prod) | **MinIO everywhere** (unified) |
| **Target Access** | Host filesystem mounts | **Upload to MinIO** (secure) |
| **Registry** | Eliminated | **Eliminated** (workflows in volume, not images) |
| **Services** | 1 (Temporal only) | 6 (Temporal + MinIO + 3+ vertical workers) |
| **Memory** | "~4.5GB" | **~2.3GB** (realistic calculation) |
### Key Insights
1. **Dynamic workflows ARE compatible** with long-lived workers via volume mounting
2. **Verticals solve** the toolchain problem (pre-built, no per-workflow installs)
3. **MinIO is lightweight** (256MB with CI_CD=true) and provides unified storage
4. **No registry overhead** (workflow code mounted, not built into images)
5. **Better marketing** (sell "security verticals", not "orchestration platform")
### What This Means
- ✅ Migration still recommended
- ✅ Timeline extended to 10 weeks (from 8)
- ✅ More services but better architecture
- ✅ Addresses all original pain points
- ✅ Supports dynamic workflows correctly
**See ARCHITECTURE.md v2.0 for full details.**
---
## Table of Contents
1. [Executive Summary](#executive-summary)
2. [Current State Analysis](#current-state-analysis)
3. [Proposed Solution: Temporal Migration](#proposed-solution-temporal-migration)
4. [For & Against: Temporal vs Prefect](#for--against-temporal-vs-prefect)
5. [For & Against: Long-Lived vs Ephemeral Workers](#for--against-long-lived-vs-ephemeral-workers)
6. [Future Consideration: Nomad vs Kubernetes vs Docker Compose](#future-consideration-nomad-vs-kubernetes-vs-docker-compose)
7. [Benefits Summary](#benefits-summary)
8. [Risks & Mitigations](#risks--mitigations)
9. [Cost Analysis](#cost-analysis)
10. [Timeline & Effort](#timeline--effort)
11. [Licensing Considerations](#licensing-considerations)
12. [Recommendation](#recommendation)
---
## Executive Summary
### The Proposal
**Migrate from Prefect to Temporal** for workflow orchestration, simplifying infrastructure from 6 services to 1 while maintaining module architecture and preparing for future scale.
### Why Consider This?
Current Prefect setup has grown complex with:
- 6 services to manage (Prefect, Postgres, Redis, Registry, Docker-proxy, Worker)
- Unclear scaling path for high-volume production
- Registry overhead for module isolation
- Complex volume mounting configuration
### Key Decision Points
| Decision | Recommendation | Timeline |
|----------|---------------|----------|
| **Replace Prefect?** | ✅ Yes - with Temporal | Now (Weeks 1-8) |
| **Worker Strategy?** | ✅ Long-lived containers | Now (Weeks 3-4) |
| **Storage Strategy?** | ✅ Abstract layer (Local→S3) | Now (Week 3) |
| **Add Nomad?** | ⏳ Later - when 10+ hosts | 18-24 months |
| **Add Kubernetes?** | ❌ No - unnecessary complexity | N/A |
### Bottom Line
**Recommended:** Proceed with Temporal migration.
- **Effort:** 8 weeks, Medium complexity
- **Risk:** Low (rollback possible, modules unchanged)
- **Benefit:** 83% infrastructure reduction, clear scaling path, better reliability
---
## Current State Analysis
### Prefect Architecture (Current)
```
Infrastructure:
├─ Prefect Server (orchestration)
├─ Postgres (metadata storage)
├─ Redis (task queue)
├─ Docker Registry (image sharing)
├─ Docker Proxy (container isolation)
└─ Prefect Worker (execution)
Total: 6 services
```
### Strengths of Current Setup
| Aspect | Rating | Notes |
|--------|--------|-------|
| **Familiarity** | ✅ High | Team knows Prefect well |
| **Functionality** | ✅ Good | Workflows execute successfully |
| **Module System** | ✅ Excellent | BaseModule interface is solid |
| **Documentation** | ✅ Good | Internal docs exist |
### Pain Points
| Issue | Impact | Frequency | Severity |
|-------|--------|-----------|----------|
| **Infrastructure Complexity** | Managing 6 services | Continuous | High |
| **Registry Overhead** | Push/pull for every deployment | Every change | Medium |
| **Unclear Scaling** | How to go multi-host? | Future planning | High |
| **Resource Usage** | ~8GB under load | Continuous | Medium |
| **Volume Mounting** | Complex job_variables config | Every workflow | Medium |
### Why Change Now?
1. **Planning for Scale:** Need clear path from 1 host → multi-host → cluster
2. **Infrastructure Debt:** 6 services growing harder to maintain
3. **Better Options Available:** Temporal provides simpler, more scalable solution
4. **Module System Stable:** Can migrate orchestration without touching modules
5. **Right Time:** Before production scale makes migration harder
---
## Proposed Solution: Temporal Migration
### Target Architecture
```
Infrastructure:
├─ Temporal Server (orchestration + storage)
└─ Worker Pools (3 types, auto-discover modules)
Total: 1 service (+ workers)
```
### Migration Phases
**Phase 1: Single Host (Weeks 1-8)**
- Replace Prefect with Temporal
- Long-lived worker pools
- LocalVolumeStorage (volume mounts)
- Capacity: 15-50 concurrent workflows
**Phase 2: Multi-Host (Months 6-18)**
- Same architecture, multiple hosts
- Switch to S3CachedStorage
- Capacity: 3× Phase 1
**Phase 3: Nomad Cluster (Months 18+, if needed)**
- Add Nomad for advanced orchestration
- Auto-scaling, multi-tenancy
- Capacity: Unlimited horizontal scaling
---
## For & Against: Temporal vs Prefect
### Option A: Keep Prefect (Status Quo)
#### ✅ For (Arguments to Keep Prefect)
1. **No Migration Effort**
- Zero weeks of migration work
- No learning curve
- No risk of migration issues
2. **Team Familiarity**
- Team knows Prefect well
- Existing operational runbooks
- Established debugging patterns
3. **Working System**
- Current workflows function correctly
- No immediate technical blocker
- "If it ain't broke, don't fix it"
4. **Deferred Complexity**
- Can delay architecture decisions
- Focus on feature development
- Postpone infrastructure changes
#### ❌ Against (Arguments Against Keeping Prefect)
1. **Infrastructure Complexity**
- 6 services to manage and monitor
- Complex dependencies (Postgres, Redis, Registry)
- High operational overhead
2. **Scaling Uncertainty**
- Unclear how to scale beyond single host
- Registry becomes bottleneck at scale
- No clear multi-host story
3. **Resource Inefficiency**
- ~2GB idle, ~8GB under load
- Registry storage overhead
- Redundant service layers
4. **Technical Debt Accumulation**
- Complexity will only increase
- Harder to migrate later (more workflows)
- Missing modern features (durable execution)
5. **Prefect Ecosystem Concerns**
- Prefect 3.x changes from 2.x
- Community split (Cloud vs self-hosted)
- Uncertain long-term roadmap
### Option B: Migrate to Temporal (Recommended)
#### ✅ For (Arguments to Migrate)
1. **Dramatic Simplification**
- 6 services → 1 service (83% reduction)
- No registry needed (local images)
- Simpler volume mounting
2. **Better Reliability**
- Durable execution (workflows survive crashes)
- Built-in state persistence
- Proven at massive scale (Netflix, Uber, Snap)
3. **Clear Scaling Path**
- Single host → Multi-host → Nomad cluster
- Architecture designed for scale
- Storage abstraction enables seamless transition
4. **Superior Workflow Engine**
- True durable execution vs task queue
- Better state management
- Handles long-running workflows (fuzzing campaigns)
- Activity timeouts and retries built-in
5. **Operational Benefits**
- Better Web UI for debugging
- Comprehensive workflow history
- Query workflow state at any time
- Simpler deployment (single service)
6. **Future-Proof Architecture**
- Easy Nomad migration path (18+ months)
- Multi-tenancy ready (namespaces)
- Auto-scaling capable
- Industry momentum (growing adoption)
7. **Module Preservation**
- Zero changes to BaseModule interface
- Module discovery unchanged
- Workflows adapt easily (@flow@workflow)
8. **Resource Efficiency**
- ~1GB idle, ~4.5GB under load
- 44% reduction in resource usage
- No registry storage overhead
#### ❌ Against (Arguments Against Migration)
1. **Migration Effort**
- 8 weeks of focused work
- Team capacity diverted from features
- Testing and validation required
2. **Learning Curve**
- New concepts (workflows vs activities)
- Different debugging approach
- Team training needed
3. **Migration Risk**
- Potential for workflow disruption
- Bugs in migration code
- Temporary performance issues
4. **Unknown Unknowns**
- May discover edge cases
- Performance characteristics differ
- Integration challenges possible
5. **Temporal Limitations**
- Less mature than Prefect in some areas
- Smaller community (growing)
- Fewer pre-built integrations
### Scoring Matrix
| Criteria | Weight | Prefect | Temporal | Winner |
|----------|--------|---------|----------|--------|
| **Infrastructure Complexity** | 25% | 3/10 | 9/10 | Temporal |
| **Scalability** | 20% | 4/10 | 9/10 | Temporal |
| **Reliability** | 20% | 7/10 | 10/10 | Temporal |
| **Migration Effort** | 15% | 10/10 | 4/10 | Prefect |
| **Team Familiarity** | 10% | 9/10 | 3/10 | Prefect |
| **Resource Efficiency** | 10% | 5/10 | 8/10 | Temporal |
| **Total** | 100% | **5.5/10** | **7.65/10** | **Temporal** |
**Conclusion:** Temporal wins on technical merit despite migration costs.
---
## For & Against: Long-Lived vs Ephemeral Workers
### Context
Workers can spawn ephemeral containers per workflow (like Prefect) or run as long-lived containers processing multiple workflows.
### Option A: Ephemeral Containers
#### ✅ For
1. **Complete Isolation**
- Each workflow in fresh container
- No state leakage between workflows
- Maximum security
2. **Automatic Cleanup**
- Containers destroyed after workflow
- No resource leaks
- Clean slate every time
3. **Matches Current Behavior**
- Similar to Prefect approach
- Easier mental model
- Less architecture change
4. **Simple Development**
- Test with `docker run`
- No complex lifecycle management
- Easy to debug
#### ❌ Against
1. **Performance Overhead**
- 5 second startup per container
- At 450 workflows/hour: 625 minutes wasted
- Unacceptable at production scale
2. **Resource Churn**
- Constant container creation/destruction
- Docker daemon overhead
- Network/volume setup repeated
3. **Scaling Limitations**
- Can't handle high-volume workloads
- Startup overhead compounds
- Poor resource utilization
### Option B: Long-Lived Workers (Recommended)
#### ✅ For
1. **Zero Startup Overhead**
- Containers already running
- Immediate workflow execution
- Critical for high-volume production
2. **Resource Efficiency**
- Fixed 4.5GB RAM handles 15 concurrent workflows
- vs ~76GB for ephemeral approach
- 10-20× better resource utilization
3. **Predictable Performance**
- Consistent response times
- No container startup jitter
- Better SLA capability
4. **Horizontal Scaling**
- Add more workers linearly
- Each worker handles N concurrent
- Clear capacity planning
5. **Production-Ready**
- Proven pattern (Uber, Airbnb)
- Handles thousands of workflows/day
- Industry standard for scale
#### ❌ Against
1. **Volume Mounting Complexity**
- Must mount parent directories
- Or implement S3 storage backend
- More sophisticated configuration
2. **Shared Container State**
- Workers reused across workflows
- Potential for subtle bugs
- Requires careful module design
3. **Lifecycle Management**
- Must handle worker restarts
- Graceful shutdown needed
- More complex monitoring
4. **Memory Management**
- Workers accumulate memory over time
- Need periodic restarts
- Requires memory limits
### Decision Matrix
| Scenario | Ephemeral | Long-Lived | Winner |
|----------|-----------|------------|--------|
| **Development** | ✅ Simpler | ⚠️ Complex | Ephemeral |
| **Low Volume (<10/hour)** | ✅ Acceptable | ✅ Overkill | Ephemeral |
| **Medium Volume (10-100/hour)** | ⚠️ Wasteful | ✅ Efficient | Long-Lived |
| **High Volume (>100/hour)** | ❌ Unusable | ✅ Required | Long-Lived |
| **Production Scale** | ❌ No | ✅ Yes | Long-Lived |
**Recommendation:** Long-lived workers for production deployment.
**Compromise:** Can start with ephemeral for Phase 1 (proof of concept), migrate to long-lived for Phase 2 (production).
---
## Future Consideration: Nomad vs Kubernetes vs Docker Compose
### When to Consider Orchestration Beyond Docker Compose?
**Trigger Points:**
- ✅ Managing 10+ hosts manually
- ✅ Need multi-tenancy (customer isolation)
- ✅ Require auto-scaling based on metrics
- ✅ Want sophisticated scheduling (bin-packing, constraints)
**Timeline Estimate:** 18-24 months from now
### Option A: Docker Compose (Recommended for Phase 1-2)
#### ✅ For
1. **Simplicity**
- Single YAML file
- No cluster setup
- Easy to understand and debug
2. **Zero Learning Curve**
- Team already knows Docker
- Familiar commands
- Abundant documentation
3. **Sufficient for 1-5 Hosts**
- Deploy same compose file to each host
- Manual but manageable
- Works for current scale
4. **Development Friendly**
- Same config dev and prod
- Fast iteration cycle
- Easy local testing
5. **No Lock-In**
- Easy to migrate to Nomad/K8s later
- Workers portable by design
- Clean exit strategy
#### ❌ Against
1. **Manual Coordination**
- No automatic scheduling
- Manual load balancing
- No health-based rescheduling
2. **Limited Scaling**
- Practical limit ~5-10 hosts
- No auto-scaling
- Manual capacity planning
3. **No Multi-Tenancy**
- Can't isolate customers
- No resource quotas
- Shared infrastructure
4. **Basic Monitoring**
- No cluster-wide metrics
- Per-host monitoring only
- Limited observability
**Verdict:** Perfect for Phase 1 (single host) and Phase 2 (3-5 hosts). Transition to Nomad/K8s at Phase 3.
### Option B: Nomad (Recommended for Phase 3)
#### ✅ For
1. **Operational Simplicity**
- Single binary (vs K8s complexity)
- Easy to install and maintain
- Lower operational overhead
2. **Perfect Fit for Use Case**
- Batch workload focus
- Resource management built-in
- Namespace support for multi-tenancy
3. **Multi-Workload Support**
- Containers (Docker)
- VMs (QEMU)
- Bare processes
- Java JARs
- All in one scheduler
4. **Scheduling Intelligence**
- Bin-packing for efficiency
- Constraint-based placement
- Affinity/anti-affinity rules
- Resource quotas per namespace
5. **Easy Migration from Docker Compose**
- Similar concepts
- `compose-to-nomad` converter tool
- Workers unchanged
- 1-2 week migration
6. **HashiCorp Ecosystem**
- Integrates with Consul (service discovery)
- Integrates with Vault (secrets)
- Proven at scale (Cloudflare, CircleCI)
7. **Auto-Scaling**
- Built-in scaling policies
- Prometheus integration
- Queue-depth based scaling
- Horizontal scaling automatic
#### ❌ Against
1. **Learning Curve**
- HCL syntax to learn
- New concepts (allocations, deployments)
- Consul integration complexity
2. **Smaller Ecosystem**
- Fewer tools than Kubernetes
- Smaller community
- Less third-party integrations
3. **Network Isolation**
- Less sophisticated than K8s
- Requires Consul Connect for service mesh
- Weaker network policies
4. **Maturity**
- Less mature than Kubernetes
- Fewer production battle stories
- Evolving feature set
**Verdict:** Excellent choice when outgrow Docker Compose. Simpler than K8s, perfect for FuzzForge scale.
### Option C: Kubernetes
#### ✅ For
1. **Industry Standard**
- Largest ecosystem
- Most third-party integrations
- Abundant expertise available
2. **Feature Richness**
- Sophisticated networking (Network Policies)
- Advanced scheduling
- Rich operator ecosystem
- Helm charts for everything
3. **Multi-Tenancy**
- Strong namespace isolation
- RBAC fine-grained
- Network policies
- Pod Security Policies
4. **Massive Scale**
- Proven to 5,000+ nodes
- Google-scale reliability
- Battle-tested
5. **Cloud Integration**
- Native on all clouds (EKS, GKE, AKS)
- Managed offerings reduce complexity
- Auto-scaling (HPA, Cluster Autoscaler)
#### ❌ Against
1. **Operational Complexity**
- High learning curve
- Complex to set up and maintain
- Requires dedicated ops team
2. **Resource Overhead**
- Control plane resource usage
- etcd cluster management
- More moving parts
3. **Overkill for Use Case**
- FuzzForge is batch workload, not microservices
- Don't need K8s networking complexity
- Simpler alternatives sufficient
4. **Container-Only**
- Can't run VMs easily
- Can't run bare processes
- Nomad more flexible
5. **Cost**
- Higher operational cost
- More infrastructure required
- Steeper learning investment
**Verdict:** Overkill for FuzzForge. Choose only if planning 1,000+ hosts or need extensive ecosystem.
### Comparison Matrix
| Feature | Docker Compose | Nomad | Kubernetes |
|---------|---------------|-------|------------|
| **Operational Complexity** | ★☆☆☆☆ (Lowest) | ★★☆☆☆ (Low) | ★★★★☆ (High) |
| **Learning Curve** | ★☆☆☆☆ (Easy) | ★★★☆☆ (Medium) | ★★★★★ (Steep) |
| **Setup Time** | Minutes | 1 day | 1-2 weeks |
| **Best For** | 1-5 hosts | 10-500 hosts | 500+ hosts |
| **Auto-Scaling** | ❌ No | ✅ Yes | ✅ Yes |
| **Multi-Tenancy** | ❌ No | ✅ Yes (Namespaces) | ✅ Yes (Advanced) |
| **Workload Types** | Containers | Containers + VMs + Processes | Containers (mainly) |
| **Service Mesh** | ❌ No | ⚠️ Via Consul Connect | ✅ Istio/Linkerd |
| **Ecosystem Size** | Medium | Small | Huge |
| **Resource Efficiency** | High | High | Medium |
| **FuzzForge Fit** | ✅ Phase 1-2 | ✅ Phase 3+ | ⚠️ Unnecessary |
### Recommendation Timeline
```
Months 0-6: Docker Compose (Single Host)
└─ Simplest, fastest to implement
Months 6-18: Docker Compose (Multi-Host)
└─ Scale to 3-5 hosts manually
Months 18+: Nomad (if needed)
└─ Add when 10+ hosts or auto-scaling required
Never: Kubernetes
└─ Unless scale exceeds 500+ hosts
```
---
## Benefits Summary
### Infrastructure Benefits
| Metric | Current (Prefect) | Future (Temporal) | Improvement |
|--------|-------------------|-------------------|-------------|
| **Services to Manage** | 6 | 1 | 83% reduction |
| **Idle Memory Usage** | ~2GB | ~1GB | 50% reduction |
| **Load Memory Usage** | ~8GB | ~4.5GB | 44% reduction |
| **Docker Registry** | Required | Not needed | Eliminated |
| **Configuration Files** | 6 service configs | 1 config | 83% simpler |
| **Deployment Complexity** | High | Low | Significant |
### Operational Benefits
1. **Simpler Monitoring**
- 1 service vs 6
- Single Web UI (Temporal)
- Fewer alerts to configure
2. **Easier Debugging**
- Complete workflow history in Temporal
- Query workflow state at any time
- Better error visibility
3. **Faster Deployments**
- No registry push/pull
- Restart 1 service vs 6
- Quicker iteration cycles
4. **Better Reliability**
- Durable execution (workflows survive crashes)
- Automatic retries built-in
- State persistence guaranteed
5. **Clear Scaling Path**
- Phase 1: Single host (now)
- Phase 2: Multi-host (6-18 months)
- Phase 3: Nomad cluster (18+ months)
### Developer Experience Benefits
1. **Local Development**
- Simpler docker-compose
- Faster startup (fewer services)
- Easier to reason about
2. **Module Development**
- No changes to BaseModule
- Same discovery mechanism
- Same testing approach
3. **Workflow Development**
- Better debugging tools (Temporal Web UI)
- Workflow history visualization
- Easier to test retry logic
4. **Onboarding**
- 1 service to understand vs 6
- Clearer architecture
- Less to learn
---
## Risks & Mitigations
### Risk 1: Migration Introduces Bugs
**Likelihood:** Medium
**Impact:** High
**Risk Score:** 6/10
**Mitigation:**
- Phased migration (one workflow at a time)
- Parallel run (Prefect + Temporal) during transition
- Comprehensive testing before cutover
- Rollback plan documented
### Risk 2: Performance Degradation
**Likelihood:** Low
**Impact:** Medium
**Risk Score:** 3/10
**Mitigation:**
- Load testing before production
- Monitor key metrics during migration
- Temporal proven at higher scale than current
- Easy to tune worker concurrency
### Risk 3: Team Learning Curve
**Likelihood:** High
**Impact:** Low
**Risk Score:** 4/10
**Mitigation:**
- Training sessions on Temporal concepts
- Pair programming during migration
- Comprehensive documentation
- Temporal has excellent docs
### Risk 4: Unknown Edge Cases
**Likelihood:** Medium
**Impact:** Medium
**Risk Score:** 5/10
**Mitigation:**
- Thorough testing with real workflows
- Gradual rollout (dev → staging → production)
- Keep Prefect running initially
- Community support available
### Risk 5: Module System Incompatibility
**Likelihood:** Very Low
**Impact:** High
**Risk Score:** 2/10
**Mitigation:**
- Module interface preserved (BaseModule unchanged)
- Only orchestration changes
- Modules are decoupled from Prefect
- Test suite validates module behavior
### Risk 6: Long-Lived Worker Stability
**Likelihood:** Low
**Impact:** Medium
**Risk Score:** 3/10
**Mitigation:**
- Proper resource limits (memory, CPU)
- Periodic worker restarts (daily)
- Monitoring for memory leaks
- Health checks and auto-restart
### Overall Risk Assessment
**Total Risk Score:** 23/60 (38%) - **Medium-Low Risk**
**Conclusion:** Risks are manageable with proper planning and mitigation strategies.
---
## Cost Analysis
### Current Costs (Prefect)
**Infrastructure:**
```
Single Host (8GB RAM, 4 CPU):
- Cloud VM: $80-120/month
- Or bare metal amortized: ~$50/month
Services Running:
- Prefect Server: ~500MB
- Postgres: ~200MB
- Redis: ~100MB
- Registry: ~500MB
- Docker Proxy: ~50MB
- Worker: ~500MB
- Workflows: ~6GB (peak)
Total: ~8GB
Development Time:
- Maintenance: ~2 hours/week
- Debugging: ~3 hours/week
- Deployments: ~1 hour/week
Total: 6 hours/week = $600/month (at $25/hour)
```
**Monthly Total:** ~$700/month
### Future Costs (Temporal)
**Phase 1 - Single Host:**
```
Single Host (6GB RAM, 4 CPU):
- Cloud VM: $60-80/month
- Or bare metal amortized: ~$40/month
Services Running:
- Temporal: ~1GB
- Workers: ~3.5GB
- Workflows: ~1GB (peak)
Total: ~5.5GB
Development Time:
- Maintenance: ~1 hour/week
- Debugging: ~2 hours/week
- Deployments: ~0.5 hour/week
Total: 3.5 hours/week = $350/month
```
**Monthly Total:** ~$430/month
**Phase 2 - Multi-Host (3 hosts):**
```
3 Hosts + S3 Storage:
- Cloud VMs: $180-240/month
- S3 storage (1TB): ~$23/month
- S3 transfer (100GB): ~$9/month
Development Time:
- Maintenance: ~2 hours/week
- Monitoring: ~2 hours/week
Total: 4 hours/week = $400/month
```
**Monthly Total:** ~$670/month (3× capacity)
**Phase 3 - Nomad Cluster (10+ hosts):**
```
Nomad Cluster:
- 3 Nomad servers: $120/month
- 10 worker hosts: $800/month
- S3 storage (5TB): ~$115/month
- Load balancer: ~$20/month
Development Time:
- Nomad maintenance: ~3 hours/week
- Monitoring: ~3 hours/week
Total: 6 hours/week = $600/month
```
**Monthly Total:** ~$1,655/month (10× capacity)
### Cost Comparison
| Phase | Hosts | Capacity | Monthly Cost | Cost per Workflow |
|-------|-------|----------|--------------|-------------------|
| **Current (Prefect)** | 1 | 10K/day | $700 | $0.0023 |
| **Phase 1 (Temporal)** | 1 | 10K/day | $430 | $0.0014 |
| **Phase 2 (Temporal)** | 3 | 30K/day | $670 | $0.0007 |
| **Phase 3 (Nomad)** | 10 | 100K/day | $1,655 | $0.0005 |
**Savings:**
- Phase 1 vs Current: **$270/month (39% reduction)**
- Better cost efficiency as scale increases
---
## Timeline & Effort
### Phase 1: Temporal Migration (8 Weeks)
**Week 1-2: Foundation**
- Deploy Temporal server
- Remove Prefect infrastructure
- Implement storage abstraction layer
- Effort: 60-80 hours
**Week 3-4: Workers**
- Create long-lived worker pools
- Implement module auto-discovery
- Configure Docker Compose
- Effort: 60-80 hours
**Week 5-6: Workflows**
- Migrate workflows to Temporal
- Convert @flow@workflow.defn
- Test all workflows
- Effort: 60-80 hours
**Week 7: Integration**
- Update backend API
- End-to-end testing
- Load testing
- Effort: 40-60 hours
**Week 8: Documentation & Cleanup**
- Update documentation
- Remove old code
- Training sessions
- Effort: 30-40 hours
**Total Effort:** 250-340 hours (~2 engineers for 2 months)
### Phase 2: Multi-Host (When Needed)
**Effort:** 40-60 hours
- Set up S3 storage
- Deploy to multiple hosts
- Configure load balancing
- Test and validate
### Phase 3: Nomad (If Needed)
**Effort:** 80-120 hours
- Install Nomad cluster
- Convert jobs to Nomad
- Set up auto-scaling
- Production deployment
---
## Licensing Considerations
### Overview
**Critical Context:** FuzzForge is a **generic platform** where modules and workflows "could be anything" - not limited to fuzzing or security analysis. This significantly impacts the licensing assessment, particularly for Nomad's Business Source License.
### Temporal Licensing: ✅ SAFE
**License:** MIT License
**Status:** Fully open source, zero restrictions
**Commercial Use:**
- ✅ Use in production
- ✅ Sell services built on Temporal
- ✅ Modify source code
- ✅ Redistribute
- ✅ Sublicense
- ✅ Private use
**Conclusion:** Temporal has **no licensing concerns** for any use case. You can build any type of platform (fuzzing, security, generic workflows, orchestration-as-a-service) without legal risk.
**Reference:** https://github.com/temporalio/temporal/blob/master/LICENSE
---
### Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION
**License:** Business Source License 1.1 (BSL 1.1)
**Status:** Source-available but with restrictions
#### BSL 1.1 Key Terms
**Change Date:** 4 years after each version release
**Change License:** Mozilla Public License 2.0 (MPL 2.0)
**After 4 years:** Each version becomes fully open source under MPL 2.0
#### The Critical Restriction
```
Additional Use Grant:
You may make use of the Licensed Work, provided that you do not use
the Licensed Work for a Competitive Offering.
A "Competitive Offering" is a commercial product or service that is:
1. Substantially similar to the capabilities of the Licensed Work
2. Offered to third parties on a paid or free basis
```
#### What This Means for FuzzForge
**The licensing risk depends on how FuzzForge is marketed and positioned:**
##### ✅ LIKELY SAFE: Specific Use Case Platform
If FuzzForge is marketed as a **specialized platform** for specific domains:
**Examples:**
- ✅ "FuzzForge - Security Analysis Platform"
- ✅ "FuzzForge - Automated Fuzzing Service"
- ✅ "FuzzForge - Code Analysis Tooling"
- ✅ "FuzzForge - Vulnerability Assessment Platform"
**Why Safe:**
- Nomad is used **internally** for infrastructure
- Customer is buying **fuzzing/security services**, not orchestration
- Platform's value is the **domain expertise**, not the scheduler
- Not competing with HashiCorp's offerings
##### ⚠️ GRAY AREA: Generic Workflow Platform
If FuzzForge pivots to emphasize **generic workflow capabilities**:
**Examples:**
- ⚠️ "FuzzForge - Workflow Orchestration Platform"
- ⚠️ "FuzzForge - Run any containerized workload"
- ⚠️ "FuzzForge - Generic task scheduler"
- ⚠️ Marketing that emphasizes "powered by Nomad"
**Why Risky:**
- Could be seen as competing with Nomad Enterprise
- Offering similar capabilities to HashiCorp's products
- Customer might use it as Nomad replacement
##### ❌ CLEARLY VIOLATES: Orchestration-as-a-Service
If FuzzForge becomes primarily an **orchestration product**:
**Examples:**
- ❌ "FuzzForge Orchestrator - Schedule any workload"
- ❌ "Nomad-as-a-Service powered by FuzzForge"
- ❌ "Generic container orchestration platform"
- ❌ Reselling Nomad capabilities with thin wrapper
**Why Violation:**
- Directly competing with HashiCorp Nomad offerings
- "Substantially similar" to Nomad's capabilities
- Commercial offering of orchestration
#### Real-World Precedents
**HashiCorp has NOT** (as of 2025) aggressively enforced BSL against companies using their tools internally. The restriction targets:
- Cloud providers offering "managed Nomad" services
- Companies building Nomad competitors
- Vendors reselling HashiCorp functionality
**NOT targeting:**
- Companies using Nomad for internal infrastructure
- SaaS platforms that happen to use Nomad
- Domain-specific platforms (like FuzzForge's security focus)
#### Decision Tree: Should I Use Nomad?
```
┌─────────────────────────────────────┐
│ Is orchestration your core product? │
└─────────────────────────────────────┘
┌────────┴────────┐
│ │
YES NO
│ │
┌────┴────┐ ┌────┴────┐
│ DON'T │ │ What's │
│ USE │ │ your │
│ NOMAD │ │ value │
│ │ │ prop? │
└─────────┘ └─────┬────┘
┌───────────┴───────────┐
│ │
Domain Expertise Orchestration Features
(Fuzzing, Security) (Scheduling, Auto-scale)
│ │
┌────┴────┐ ┌────┴────┐
│ SAFE TO │ │ RISKY - │
│ USE │ │ CONSULT │
│ NOMAD │ │ LAWYER │
└─────────┘ └─────────┘
```
#### FuzzForge Current Position
**Current Positioning:** Domain-specific security/analysis platform
**Nomad Usage:** Internal infrastructure (not customer-facing)
**Risk Level:** **LOW** (likely safe)
**However**, user stated: _"modules and workflows could be anything"_ - this suggests potential future expansion beyond security domain.
**If FuzzForge pivots to generic platform:**
- Risk increases from LOW → MEDIUM
- Need legal review before Phase 3 (Nomad migration)
- Consider Kubernetes as alternative
---
### Kubernetes Licensing: ✅ SAFE
**License:** Apache License 2.0
**Status:** Fully open source, zero restrictions
**Commercial Use:**
- ✅ Use in production
- ✅ Sell services built on Kubernetes
- ✅ Modify source code
- ✅ Offer managed Kubernetes (AWS EKS, GCP GKE do this)
- ✅ Build competitive offerings
**Conclusion:** Kubernetes has **no licensing concerns** whatsoever, even for orchestration-as-a-service offerings.
---
### Docker Licensing: ✅ SAFE
**License:** Apache License 2.0
**Status:** Fully open source
**Note:** Docker Desktop has separate commercial licensing requirements for organizations >250 employees or >$10M revenue, but Docker Engine (which FuzzForge uses) remains free for all uses.
---
### Licensing Recommendation Matrix
| Component | License | FuzzForge Risk | Recommendation |
|-----------|---------|----------------|----------------|
| **Temporal** | MIT | ✅ None | Use freely |
| **Docker Engine** | Apache 2.0 | ✅ None | Use freely |
| **Nomad** | BSL 1.1 | ⚠️ Low-Medium | Safe if domain-specific |
| **Kubernetes** | Apache 2.0 | ✅ None | Safe alternative to Nomad |
---
### Recommendations by Phase
#### Phase 1 & 2: Temporal + Docker Compose
**Licenses:** MIT (Temporal) + Apache 2.0 (Docker)
**Risk:****ZERO** - Fully safe for any use case
**Action:** Proceed without legal review required
---
#### Phase 3: Adding Nomad (18+ months)
**License:** BSL 1.1
**Risk:** ⚠️ **LOW-MEDIUM** - Depends on positioning
**Action Required BEFORE Migration:**
1. **Clarify Product Positioning**
- Will FuzzForge market as generic platform?
- Or remain domain-specific (security/fuzzing)?
2. **Legal Review** (Recommended)
- Consult IP lawyer familiar with BSL
- Show marketing materials, website copy
- Get written opinion on BSL compliance
- Cost: $2,000-5,000 (one-time)
3. **Decision Point:**
```
IF positioning = domain-specific (security/fuzzing)
THEN proceed with Nomad (low risk)
ELSE IF positioning = generic platform
THEN consider Kubernetes instead (zero risk)
```
---
#### Alternative: Use Kubernetes Instead of Nomad
**If concerned about Nomad BSL risk:**
**Pros:**
- ✅ Zero licensing risk (Apache 2.0)
- ✅ Can offer orchestration-as-a-service freely
- ✅ Larger ecosystem and community
- ✅ Managed offerings on all clouds
**Cons:**
- ❌ Higher operational complexity than Nomad
- ❌ Overkill for batch workload use case
- ❌ Steeper learning curve
**When to Choose K8s Over Nomad:**
- Planning to market as generic platform
- Uncomfortable with BSL restrictions
- Need absolute licensing certainty
- Have K8s expertise already
---
### Licensing Risk Summary
| Scenario | Temporal | Docker | Nomad | Kubernetes |
|----------|----------|--------|-------|------------|
| **Security platform (current)** | ✅ Safe | ✅ Safe | ✅ Safe | ✅ Safe |
| **Generic workflow platform** | ✅ Safe | ✅ Safe | ⚠️ Risky | ✅ Safe |
| **Orchestration-as-a-service** | ✅ Safe | ✅ Safe | ❌ Violation | ✅ Safe |
---
### Key Takeaways
1. **Temporal is completely safe** - MIT license has zero restrictions for any use case
2. **Nomad's BSL depends on positioning**:
- ✅ Safe for domain-specific platforms (security, fuzzing)
- ⚠️ Risky for generic workflow platforms
- ❌ Violation for orchestration-as-a-service
3. **User's statement matters**: _"modules could be anything"_ suggests generic platform potential → increases Nomad risk
4. **Mitigation strategies**:
- Keep marketing focused on domain expertise
- Get legal review before Phase 3 (Nomad)
- Alternative: Use Kubernetes (Apache 2.0) instead
5. **Decision timing**: No urgency - Nomad decision is 18+ months away (Phase 3)
6. **Recommended approach**:
```
Now → Phase 1-2: Temporal + Docker Compose (zero risk)
18 months → Phase 3: Re-evaluate positioning
→ Domain-specific? Use Nomad
→ Generic platform? Use Kubernetes
```
---
## Recommendation
### Primary Recommendation: **PROCEED WITH TEMPORAL MIGRATION**
**Confidence Level:** High (8/10)
### Rationale
1. **Technical Benefits Outweigh Costs**
- 83% infrastructure reduction
- 44% resource savings
- Clear scaling path
- Better reliability
2. **Manageable Risks**
- Low-medium risk profile
- Good mitigation strategies
- Rollback plan exists
- Module system preserved
3. **Right Timing**
- Before production scale makes migration harder
- Team capacity available
- Module architecture stable
- Clear 8-week timeline
4. **Future-Proof**
- Easy Nomad migration when needed
- Multi-host ready (storage abstraction)
- Industry-proven technology
- Growing ecosystem
### Phased Approach
**Immediate (Now):**
- ✅ Approve Temporal migration
- ✅ Allocate 2 engineers for 8 weeks
- ✅ Set Week 1 start date
**Near-Term (Months 1-6):**
- ✅ Complete Temporal migration
- ✅ Validate in production
- ✅ Optimize performance
**Mid-Term (Months 6-18):**
- ⏳ Monitor scaling needs
- ⏳ Implement S3 storage if needed
- ⏳ Expand to multi-host if needed
**Long-Term (Months 18+):**
- ⏳ Evaluate Nomad necessity
- ⏳ Migrate to Nomad if triggers met
- ⏳ Continue scaling horizontally
### Decision Criteria
**Proceed with Migration if:**
- ✅ Team agrees on benefits (CHECK)
- ✅ 8-week timeline acceptable (CHECK)
- ✅ Resources available (CHECK)
- ✅ Risk profile acceptable (CHECK)
**Defer Migration if:**
- ❌ Critical features launching soon (DEPENDS)
- ❌ Team capacity constrained (DEPENDS)
- ❌ Major Prefect improvements announced (UNLIKELY)
### Alternative: Start Smaller
**If full migration seems risky:**
1. **Proof of Concept (2 weeks)**
- Migrate one simple workflow
- Validate Temporal locally
- Assess complexity
- Decision point: Continue or abort
2. **Parallel Run (4 weeks)**
- Run Temporal alongside Prefect
- Duplicate one workflow
- Compare results
- Build confidence
3. **Full Migration (6 weeks)**
- If POC successful, proceed
- Migrate remaining workflows
- Decommission Prefect
**Total:** 12 weeks (vs 8 weeks direct)
---
## Appendix: Quick Reference
### One-Page Summary
**WHAT:** Migrate from Prefect to Temporal
**WHY:** Simpler (6 services → 1), more scalable, better reliability
**WHEN:** Now (8 weeks)
**WHO:** 2 engineers
**COST:** $430/month (vs $700 current) = 39% savings
**RISK:** Medium-Low (manageable)
**OUTCOME:** Production-ready infrastructure with clear scaling path
### Key Metrics
| Metric | Current | Future | Change |
|--------|---------|--------|--------|
| Services | 6 | 1 | -83% |
| Memory | 8GB | 4.5GB | -44% |
| Cost | $700/mo | $430/mo | -39% |
| Capacity | 10K/day | 10K/day | Same (Phase 1) |
| Dev Time | 6h/week | 3.5h/week | -42% |
### Decision Checklist
- [ ] Review this document with team
- [ ] Discuss concerns and questions
- [ ] Vote: Proceed / Defer / Reject
- [ ] If proceed: Assign engineers
- [ ] If proceed: Set start date
- [ ] If defer: Set review date (3 months)
- [ ] If reject: Document reasons
---
**Document Version:** 1.0
**Last Updated:** 2025-09-30
**Next Review:** After decision or in 3 months