mirror of https://github.com/FuzzingLabs/fuzzforge_ai.git synced 2026-02-12 20:32:46 +00:00

Files

tduhamel42 60ca088ecf CI/CD Integration with Ephemeral Deployment Model (#14 )

* feat: Complete migration from Prefect to Temporal

BREAKING CHANGE: Replaces Prefect workflow orchestration with Temporal

## Major Changes
- Replace Prefect with Temporal for workflow orchestration
- Implement vertical worker architecture (rust, android)
- Replace Docker registry with MinIO for unified storage
- Refactor activities to be co-located with workflows
- Update all API endpoints for Temporal compatibility

## Infrastructure
- New: docker-compose.temporal.yaml (Temporal + MinIO + workers)
- New: workers/ directory with rust and android vertical workers
- New: backend/src/temporal/ (manager, discovery)
- New: backend/src/storage/ (S3-cached storage with MinIO)
- New: backend/toolbox/common/ (shared storage activities)
- Deleted: docker-compose.yaml (old Prefect setup)
- Deleted: backend/src/core/prefect_manager.py
- Deleted: backend/src/services/prefect_stats_monitor.py
- Deleted: Docker registry and insecure-registries requirement

## Workflows
- Migrated: security_assessment workflow to Temporal
- New: rust_test workflow (example/test workflow)
- Deleted: secret_detection_scan (Prefect-based, to be reimplemented)
- Activities now co-located with workflows for independent testing

## API Changes
- Updated: backend/src/api/workflows.py (Temporal submission)
- Updated: backend/src/api/runs.py (Temporal status/results)
- Updated: backend/src/main.py (727 lines, TemporalManager integration)
- Updated: All 16 MCP tools to use TemporalManager

## Testing
- ✅ All services healthy (Temporal, PostgreSQL, MinIO, workers, backend)
- ✅ All API endpoints functional
- ✅ End-to-end workflow test passed (72 findings from vulnerable_app)
- ✅ MinIO storage integration working (target upload/download, results)
- ✅ Worker activity discovery working (6 activities registered)
- ✅ Tarball extraction working
- ✅ SARIF report generation working

## Documentation
- ARCHITECTURE.md: Complete Temporal architecture documentation
- QUICKSTART_TEMPORAL.md: Getting started guide
- MIGRATION_DECISION.md: Why we chose Temporal over Prefect
- IMPLEMENTATION_STATUS.md: Migration progress tracking
- workers/README.md: Worker development guide

## Dependencies
- Added: temporalio>=1.6.0
- Added: boto3>=1.34.0 (MinIO S3 client)
- Removed: prefect>=3.4.18

* feat: Add Python fuzzing vertical with Atheris integration

This commit implements a complete Python fuzzing workflow using Atheris:

## Python Worker (workers/python/)
- Dockerfile with Python 3.11, Atheris, and build tools
- Generic worker.py for dynamic workflow discovery
- requirements.txt with temporalio, boto3, atheris dependencies
- Added to docker-compose.temporal.yaml with dedicated cache volume

## AtherisFuzzer Module (backend/toolbox/modules/fuzzer/)
- Reusable module extending BaseModule
- Auto-discovers fuzz targets (fuzz_*.py, *_fuzz.py, fuzz_target.py)
- Recursive search to find targets in nested directories
- Dynamically loads TestOneInput() function
- Configurable max_iterations and timeout
- Real-time stats callback support for live monitoring
- Returns findings as ModuleFinding objects

## Atheris Fuzzing Workflow (backend/toolbox/workflows/atheris_fuzzing/)
- Temporal workflow for orchestrating fuzzing
- Downloads user code from MinIO
- Executes AtherisFuzzer module
- Uploads results to MinIO
- Cleans up cache after execution
- metadata.yaml with vertical: python for routing

## Test Project (test_projects/python_fuzz_waterfall/)
- Demonstrates stateful waterfall vulnerability
- main.py with check_secret() that leaks progress
- fuzz_target.py with Atheris TestOneInput() harness
- Complete README with usage instructions

## Backend Fixes
- Fixed parameter merging in REST API endpoints (workflows.py)
- Changed workflow parameter passing from positional args to kwargs (manager.py)
- Default parameters now properly merged with user parameters

## Testing
✅ Worker discovered AtherisFuzzingWorkflow
✅ Workflow executed end-to-end successfully
✅ Fuzz target auto-discovered in nested directories
✅ Atheris ran 100,000 iterations
✅ Results uploaded and cache cleaned

* chore: Complete Temporal migration with updated CLI/SDK/docs

This commit includes all remaining Temporal migration changes:

## CLI Updates (cli/)
- Updated workflow execution commands for Temporal
- Enhanced error handling and exceptions
- Updated dependencies in uv.lock

## SDK Updates (sdk/)
- Client methods updated for Temporal workflows
- Updated models for new workflow execution
- Updated dependencies in uv.lock

## Documentation Updates (docs/)
- Architecture documentation for Temporal
- Workflow concept documentation
- Resource management documentation (new)
- Debugging guide (new)
- Updated tutorials and how-to guides
- Troubleshooting updates

## README Updates
- Main README with Temporal instructions
- Backend README
- CLI README
- SDK README

## Other
- Updated IMPLEMENTATION_STATUS.md
- Removed old vulnerable_app.tar.gz

These changes complete the Temporal migration and ensure the
CLI/SDK work correctly with the new backend.

* fix: Use positional args instead of kwargs for Temporal workflows

The Temporal Python SDK's start_workflow() method doesn't accept
a 'kwargs' parameter. Workflows must receive parameters as positional
arguments via the 'args' parameter.

Changed from:
  args=workflow_args  # Positional arguments

This fixes the error:
  TypeError: Client.start_workflow() got an unexpected keyword argument 'kwargs'

Workflows now correctly receive parameters in order:
- security_assessment: [target_id, scanner_config, analyzer_config, reporter_config]
- atheris_fuzzing: [target_id, target_file, max_iterations, timeout_seconds]
- rust_test: [target_id, test_message]

* fix: Filter metadata-only parameters from workflow arguments

SecurityAssessmentWorkflow was receiving 7 arguments instead of 2-5.
The issue was that target_path and volume_mode from default_parameters
were being passed to the workflow, when they should only be used by
the system for configuration.

Now filters out metadata-only parameters (target_path, volume_mode)
before passing arguments to workflow execution.

* refactor: Remove Prefect leftovers and volume mounting legacy

Complete cleanup of Prefect migration artifacts:

Backend:
- Delete registry.py and workflow_discovery.py (Prefect-specific files)
- Remove Docker validation from setup.py (no longer needed)
- Remove ResourceLimits and VolumeMount models
- Remove target_path and volume_mode from WorkflowSubmission
- Remove supported_volume_modes from API and discovery
- Clean up metadata.yaml files (remove volume/path fields)
- Simplify parameter filtering in manager.py

SDK:
- Remove volume_mode parameter from client methods
- Remove ResourceLimits and VolumeMount models
- Remove Prefect error patterns from docker_logs.py
- Clean up WorkflowSubmission and WorkflowMetadata models

CLI:
- Remove Volume Modes display from workflow info

All removed features are Prefect-specific or Docker volume mounting
artifacts. Temporal workflows use MinIO storage exclusively.

* feat: Add comprehensive test suite and benchmark infrastructure

- Add 68 unit tests for fuzzer, scanner, and analyzer modules
- Implement pytest-based test infrastructure with fixtures
- Add 6 performance benchmarks with category-specific thresholds
- Configure GitHub Actions for automated testing and benchmarking
- Add test and benchmark documentation

Test coverage:
- AtherisFuzzer: 8 tests
- CargoFuzzer: 14 tests
- FileScanner: 22 tests
- SecurityAnalyzer: 24 tests

All tests passing (68/68)
All benchmarks passing (6/6)

* fix: Resolve all ruff linting violations across codebase

Fixed 27 ruff violations in 12 files:
- Removed unused imports (Depends, Dict, Any, Optional, etc.)
- Fixed undefined workflow_info variable in workflows.py
- Removed dead code with undefined variables in atheris_fuzzer.py
- Changed f-string to regular string where no placeholders used

All files now pass ruff checks for CI/CD compliance.

* fix: Configure CI for unit tests only

- Renamed docker-compose.temporal.yaml → docker-compose.yml for CI compatibility
- Commented out integration-tests job (no integration tests yet)
- Updated test-summary to only depend on lint and unit-tests

CI will now run successfully with 68 unit tests. Integration tests can be added later.

* feat: Add CI/CD integration with ephemeral deployment model

Implements comprehensive CI/CD support for FuzzForge with on-demand worker management:

**Worker Management (v0.7.0)**
- Add WorkerManager for automatic worker lifecycle control
- Auto-start workers from stopped state when workflows execute
- Auto-stop workers after workflow completion
- Health checks and startup timeout handling (90s default)

**CI/CD Features**
- `--fail-on` flag: Fail builds based on SARIF severity levels (error/warning/note/info)
- `--export-sarif` flag: Export findings in SARIF 2.1.0 format
- `--auto-start`/`--auto-stop` flags: Control worker lifecycle
- Exit code propagation: Returns 1 on blocking findings, 0 on success

**Exit Code Fix**
- Add `except typer.Exit: raise` handlers at 3 critical locations
- Move worker cleanup to finally block for guaranteed execution
- Exit codes now propagate correctly even when build fails

**CI Scripts & Examples**
- ci-start.sh: Start FuzzForge services with health checks
- ci-stop.sh: Clean shutdown with volume preservation option
- GitHub Actions workflow example (security-scan.yml)
- GitLab CI pipeline example (.gitlab-ci.example.yml)
- docker-compose.ci.yml: CI-optimized compose file with profiles

**OSS-Fuzz Integration**
- New ossfuzz_campaign workflow for running OSS-Fuzz projects
- OSS-Fuzz worker with Docker-in-Docker support
- Configurable campaign duration and project selection

**Documentation**
- Comprehensive CI/CD integration guide (docs/how-to/cicd-integration.md)
- Updated architecture docs with worker lifecycle details
- Updated workspace isolation documentation
- CLI README with worker management examples

**SDK Enhancements**
- Add get_workflow_worker_info() endpoint
- Worker vertical metadata in workflow responses

**Testing**
- All workflows tested: security_assessment, atheris_fuzzing, secret_detection, cargo_fuzzing
- All monitoring commands tested: stats, crashes, status, finding
- Full CI pipeline simulation verified
- Exit codes verified for success/failure scenarios

Ephemeral CI/CD model: ~3-4GB RAM, ~60-90s startup, runs entirely in CI containers.

* fix: Resolve ruff linting violations in CI/CD code

- Remove unused variables (run_id, defaults, result)
- Remove unused imports
- Fix f-string without placeholders

All CI/CD integration files now pass ruff checks.

2025-10-14 10:13:45 +02:00

39 KiB

Raw Permalink Blame History

FuzzForge AI: Migration Decision Document

Date: 2025-10-01 (Updated) Status: Architecture Revised - Ready for Implementation Decision Makers: FuzzingLabs Team Recommendation: Migrate to Temporal with Vertical Workers + MinIO

🔄 CRITICAL UPDATE (2025-10-01)

Initial analysis was incomplete. The original architecture document missed a critical requirement:

"Workflows are dynamic and have to be created without modifying the codebase"

What Changed

The original plan proposed "no registry needed" with long-lived workers, but failed to address how dynamic workflows with custom dependencies would work. This created a fundamental contradiction.

Revised Architecture

New approach: Vertical Workers + MinIO

Aspect	Original Plan	Revised Plan
Workers	Generic long-lived	Vertical-specific (Android, Rust, Web, iOS, etc.)
Toolchains	Install per workflow	Pre-built per vertical
Workflows	Unclear	Mounted as volume (no rebuild)
Storage	LocalVolumeStorage (dev) / S3 (prod)	MinIO everywhere (unified)
Target Access	Host filesystem mounts	Upload to MinIO (secure)
Registry	Eliminated	Eliminated (workflows in volume, not images)
Services	1 (Temporal only)	6 (Temporal + MinIO + 3+ vertical workers)
Memory	"~4.5GB"	~2.3GB (realistic calculation)

Key Insights

Dynamic workflows ARE compatible with long-lived workers via volume mounting
Verticals solve the toolchain problem (pre-built, no per-workflow installs)
MinIO is lightweight (256MB with CI_CD=true) and provides unified storage
No registry overhead (workflow code mounted, not built into images)
Better marketing (sell "security verticals", not "orchestration platform")

What This Means

✅ Migration still recommended
✅ Timeline extended to 10 weeks (from 8)
✅ More services but better architecture
✅ Addresses all original pain points
✅ Supports dynamic workflows correctly

See ARCHITECTURE.md v2.0 for full details.

Executive Summary
Current State Analysis
Proposed Solution: Temporal Migration
For & Against: Temporal vs Prefect
For & Against: Long-Lived vs Ephemeral Workers
Future Consideration: Nomad vs Kubernetes vs Docker Compose
Benefits Summary
Risks & Mitigations
Cost Analysis
Timeline & Effort
Licensing Considerations
Recommendation

Executive Summary

The Proposal

Migrate from Prefect to Temporal for workflow orchestration, simplifying infrastructure from 6 services to 1 while maintaining module architecture and preparing for future scale.

Why Consider This?

Current Prefect setup has grown complex with:

6 services to manage (Prefect, Postgres, Redis, Registry, Docker-proxy, Worker)
Unclear scaling path for high-volume production
Registry overhead for module isolation
Complex volume mounting configuration

Key Decision Points

Decision	Recommendation	Timeline
Replace Prefect?	✅ Yes - with Temporal	Now (Weeks 1-8)
Worker Strategy?	✅ Long-lived containers	Now (Weeks 3-4)
Storage Strategy?	✅ Abstract layer (Local→S3)	Now (Week 3)
Add Nomad?	⏳ Later - when 10+ hosts	18-24 months
Add Kubernetes?	❌ No - unnecessary complexity	N/A

Bottom Line

Recommended: Proceed with Temporal migration.

Effort: 8 weeks, Medium complexity
Risk: Low (rollback possible, modules unchanged)
Benefit: 83% infrastructure reduction, clear scaling path, better reliability

Current State Analysis

Prefect Architecture (Current)

Infrastructure:
├─ Prefect Server (orchestration)
├─ Postgres (metadata storage)
├─ Redis (task queue)
├─ Docker Registry (image sharing)
├─ Docker Proxy (container isolation)
└─ Prefect Worker (execution)

Total: 6 services

Strengths of Current Setup

Aspect	Rating	Notes
Familiarity	✅ High	Team knows Prefect well
Functionality	✅ Good	Workflows execute successfully
Module System	✅ Excellent	BaseModule interface is solid
Documentation	✅ Good	Internal docs exist

Pain Points

Issue	Impact	Frequency	Severity
Infrastructure Complexity	Managing 6 services	Continuous	High
Registry Overhead	Push/pull for every deployment	Every change	Medium
Unclear Scaling	How to go multi-host?	Future planning	High
Resource Usage	~8GB under load	Continuous	Medium
Volume Mounting	Complex job_variables config	Every workflow	Medium

Why Change Now?

Planning for Scale: Need clear path from 1 host → multi-host → cluster
Infrastructure Debt: 6 services growing harder to maintain
Better Options Available: Temporal provides simpler, more scalable solution
Module System Stable: Can migrate orchestration without touching modules
Right Time: Before production scale makes migration harder

Proposed Solution: Temporal Migration

Target Architecture

Infrastructure:
├─ Temporal Server (orchestration + storage)
└─ Worker Pools (3 types, auto-discover modules)

Total: 1 service (+ workers)

Migration Phases

Phase 1: Single Host (Weeks 1-8)

Replace Prefect with Temporal
Long-lived worker pools
LocalVolumeStorage (volume mounts)
Capacity: 15-50 concurrent workflows

Phase 2: Multi-Host (Months 6-18)

Same architecture, multiple hosts
Switch to S3CachedStorage
Capacity: 3× Phase 1

Phase 3: Nomad Cluster (Months 18+, if needed)

Add Nomad for advanced orchestration
Auto-scaling, multi-tenancy
Capacity: Unlimited horizontal scaling

For & Against: Temporal vs Prefect

Option A: Keep Prefect (Status Quo)

✅ For (Arguments to Keep Prefect)

No Migration Effort
- Zero weeks of migration work
- No learning curve
- No risk of migration issues
Team Familiarity
- Team knows Prefect well
- Existing operational runbooks
- Established debugging patterns
Working System
- Current workflows function correctly
- No immediate technical blocker
- "If it ain't broke, don't fix it"
Deferred Complexity
- Can delay architecture decisions
- Focus on feature development
- Postpone infrastructure changes

❌ Against (Arguments Against Keeping Prefect)

Infrastructure Complexity
- 6 services to manage and monitor
- Complex dependencies (Postgres, Redis, Registry)
- High operational overhead
Scaling Uncertainty
- Unclear how to scale beyond single host
- Registry becomes bottleneck at scale
- No clear multi-host story
Resource Inefficiency
- ~2GB idle, ~8GB under load
- Registry storage overhead
- Redundant service layers
Technical Debt Accumulation
- Complexity will only increase
- Harder to migrate later (more workflows)
- Missing modern features (durable execution)
Prefect Ecosystem Concerns
- Prefect 3.x changes from 2.x
- Community split (Cloud vs self-hosted)
- Uncertain long-term roadmap

Option B: Migrate to Temporal (Recommended)

✅ For (Arguments to Migrate)

Dramatic Simplification
- 6 services → 1 service (83% reduction)
- No registry needed (local images)
- Simpler volume mounting
Better Reliability
- Durable execution (workflows survive crashes)
- Built-in state persistence
- Proven at massive scale (Netflix, Uber, Snap)
Clear Scaling Path
- Single host → Multi-host → Nomad cluster
- Architecture designed for scale
- Storage abstraction enables seamless transition
Superior Workflow Engine
- True durable execution vs task queue
- Better state management
- Handles long-running workflows (fuzzing campaigns)
- Activity timeouts and retries built-in
Operational Benefits
- Better Web UI for debugging
- Comprehensive workflow history
- Query workflow state at any time
- Simpler deployment (single service)
Future-Proof Architecture
- Easy Nomad migration path (18+ months)
- Multi-tenancy ready (namespaces)
- Auto-scaling capable
- Industry momentum (growing adoption)
Module Preservation
- Zero changes to BaseModule interface
- Module discovery unchanged
- Workflows adapt easily (@flow → @workflow)
Resource Efficiency
- ~1GB idle, ~4.5GB under load
- 44% reduction in resource usage
- No registry storage overhead

❌ Against (Arguments Against Migration)

Migration Effort
- 8 weeks of focused work
- Team capacity diverted from features
- Testing and validation required
Learning Curve
- New concepts (workflows vs activities)
- Different debugging approach
- Team training needed
Migration Risk
- Potential for workflow disruption
- Bugs in migration code
- Temporary performance issues
Unknown Unknowns
- May discover edge cases
- Performance characteristics differ
- Integration challenges possible
Temporal Limitations
- Less mature than Prefect in some areas
- Smaller community (growing)
- Fewer pre-built integrations

Scoring Matrix

Criteria	Weight	Prefect	Temporal	Winner
Infrastructure Complexity	25%	3/10	9/10	Temporal
Scalability	20%	4/10	9/10	Temporal
Reliability	20%	7/10	10/10	Temporal
Migration Effort	15%	10/10	4/10	Prefect
Team Familiarity	10%	9/10	3/10	Prefect
Resource Efficiency	10%	5/10	8/10	Temporal
Total	100%	5.5/10	7.65/10	Temporal

Conclusion: Temporal wins on technical merit despite migration costs.

For & Against: Long-Lived vs Ephemeral Workers

Context

Workers can spawn ephemeral containers per workflow (like Prefect) or run as long-lived containers processing multiple workflows.

Option A: Ephemeral Containers

✅ For

Complete Isolation
- Each workflow in fresh container
- No state leakage between workflows
- Maximum security
Automatic Cleanup
- Containers destroyed after workflow
- No resource leaks
- Clean slate every time
Matches Current Behavior
- Similar to Prefect approach
- Easier mental model
- Less architecture change
Simple Development
- Test with docker run
- No complex lifecycle management
- Easy to debug

❌ Against

Performance Overhead
- 5 second startup per container
- At 450 workflows/hour: 625 minutes wasted
- Unacceptable at production scale
Resource Churn
- Constant container creation/destruction
- Docker daemon overhead
- Network/volume setup repeated
Scaling Limitations
- Can't handle high-volume workloads
- Startup overhead compounds
- Poor resource utilization

Option B: Long-Lived Workers (Recommended)

✅ For

Zero Startup Overhead
- Containers already running
- Immediate workflow execution
- Critical for high-volume production
Resource Efficiency
- Fixed 4.5GB RAM handles 15 concurrent workflows
- vs ~76GB for ephemeral approach
- 10-20× better resource utilization
Predictable Performance
- Consistent response times
- No container startup jitter
- Better SLA capability
Horizontal Scaling
- Add more workers linearly
- Each worker handles N concurrent
- Clear capacity planning
Production-Ready
- Proven pattern (Uber, Airbnb)
- Handles thousands of workflows/day
- Industry standard for scale

❌ Against

Volume Mounting Complexity
- Must mount parent directories
- Or implement S3 storage backend
- More sophisticated configuration
Shared Container State
- Workers reused across workflows
- Potential for subtle bugs
- Requires careful module design
Lifecycle Management
- Must handle worker restarts
- Graceful shutdown needed
- More complex monitoring
Memory Management
- Workers accumulate memory over time
- Need periodic restarts
- Requires memory limits

Decision Matrix

Scenario	Ephemeral	Long-Lived	Winner
Development	✅ Simpler	⚠️ Complex	Ephemeral
Low Volume (<10/hour)	✅ Acceptable	✅ Overkill	Ephemeral
Medium Volume (10-100/hour)	⚠️ Wasteful	✅ Efficient	Long-Lived
High Volume (>100/hour)	❌ Unusable	✅ Required	Long-Lived
Production Scale	❌ No	✅ Yes	Long-Lived

Recommendation: Long-lived workers for production deployment.

Compromise: Can start with ephemeral for Phase 1 (proof of concept), migrate to long-lived for Phase 2 (production).

Future Consideration: Nomad vs Kubernetes vs Docker Compose

When to Consider Orchestration Beyond Docker Compose?

Trigger Points:

✅ Managing 10+ hosts manually
✅ Need multi-tenancy (customer isolation)
✅ Require auto-scaling based on metrics
✅ Want sophisticated scheduling (bin-packing, constraints)

Timeline Estimate: 18-24 months from now

Option A: Docker Compose (Recommended for Phase 1-2)

✅ For

Simplicity
- Single YAML file
- No cluster setup
- Easy to understand and debug
Zero Learning Curve
- Team already knows Docker
- Familiar commands
- Abundant documentation
Sufficient for 1-5 Hosts
- Deploy same compose file to each host
- Manual but manageable
- Works for current scale
Development Friendly
- Same config dev and prod
- Fast iteration cycle
- Easy local testing
No Lock-In
- Easy to migrate to Nomad/K8s later
- Workers portable by design
- Clean exit strategy

❌ Against

Manual Coordination
- No automatic scheduling
- Manual load balancing
- No health-based rescheduling
Limited Scaling
- Practical limit ~5-10 hosts
- No auto-scaling
- Manual capacity planning
No Multi-Tenancy
- Can't isolate customers
- No resource quotas
- Shared infrastructure
Basic Monitoring
- No cluster-wide metrics
- Per-host monitoring only
- Limited observability

Verdict: Perfect for Phase 1 (single host) and Phase 2 (3-5 hosts). Transition to Nomad/K8s at Phase 3.

Option B: Nomad (Recommended for Phase 3)

✅ For

Operational Simplicity
- Single binary (vs K8s complexity)
- Easy to install and maintain
- Lower operational overhead
Perfect Fit for Use Case
- Batch workload focus
- Resource management built-in
- Namespace support for multi-tenancy
Multi-Workload Support
- Containers (Docker)
- VMs (QEMU)
- Bare processes
- Java JARs
- All in one scheduler
Scheduling Intelligence
- Bin-packing for efficiency
- Constraint-based placement
- Affinity/anti-affinity rules
- Resource quotas per namespace
Easy Migration from Docker Compose
- Similar concepts
- compose-to-nomad converter tool
- Workers unchanged
- 1-2 week migration
HashiCorp Ecosystem
- Integrates with Consul (service discovery)
- Integrates with Vault (secrets)
- Proven at scale (Cloudflare, CircleCI)
Auto-Scaling
- Built-in scaling policies
- Prometheus integration
- Queue-depth based scaling
- Horizontal scaling automatic

❌ Against

Learning Curve
- HCL syntax to learn
- New concepts (allocations, deployments)
- Consul integration complexity
Smaller Ecosystem
- Fewer tools than Kubernetes
- Smaller community
- Less third-party integrations
Network Isolation
- Less sophisticated than K8s
- Requires Consul Connect for service mesh
- Weaker network policies
Maturity
- Less mature than Kubernetes
- Fewer production battle stories
- Evolving feature set

Verdict: Excellent choice when outgrow Docker Compose. Simpler than K8s, perfect for FuzzForge scale.

Option C: Kubernetes

✅ For

Industry Standard
- Largest ecosystem
- Most third-party integrations
- Abundant expertise available
Feature Richness
- Sophisticated networking (Network Policies)
- Advanced scheduling
- Rich operator ecosystem
- Helm charts for everything
Multi-Tenancy
- Strong namespace isolation
- RBAC fine-grained
- Network policies
- Pod Security Policies
Massive Scale
- Proven to 5,000+ nodes
- Google-scale reliability
- Battle-tested
Cloud Integration
- Native on all clouds (EKS, GKE, AKS)
- Managed offerings reduce complexity
- Auto-scaling (HPA, Cluster Autoscaler)

❌ Against

Operational Complexity
- High learning curve
- Complex to set up and maintain
- Requires dedicated ops team
Resource Overhead
- Control plane resource usage
- etcd cluster management
- More moving parts
Overkill for Use Case
- FuzzForge is batch workload, not microservices
- Don't need K8s networking complexity
- Simpler alternatives sufficient
Container-Only
- Can't run VMs easily
- Can't run bare processes
- Nomad more flexible
Cost
- Higher operational cost
- More infrastructure required
- Steeper learning investment

Verdict: Overkill for FuzzForge. Choose only if planning 1,000+ hosts or need extensive ecosystem.

Comparison Matrix

Feature	Docker Compose	Nomad	Kubernetes
Operational Complexity	★☆☆☆☆ (Lowest)	★★☆☆☆ (Low)	★★★★☆ (High)
Learning Curve	★☆☆☆☆ (Easy)	★★★☆☆ (Medium)	★★★★★ (Steep)
Setup Time	Minutes	1 day	1-2 weeks
Best For	1-5 hosts	10-500 hosts	500+ hosts
Auto-Scaling	❌ No	✅ Yes	✅ Yes
Multi-Tenancy	❌ No	✅ Yes (Namespaces)	✅ Yes (Advanced)
Workload Types	Containers	Containers + VMs + Processes	Containers (mainly)
Service Mesh	❌ No	⚠️ Via Consul Connect	✅ Istio/Linkerd
Ecosystem Size	Medium	Small	Huge
Resource Efficiency	High	High	Medium
FuzzForge Fit	✅ Phase 1-2	✅ Phase 3+	⚠️ Unnecessary

Recommendation Timeline

Months 0-6:   Docker Compose (Single Host)
               └─ Simplest, fastest to implement

Months 6-18:  Docker Compose (Multi-Host)
               └─ Scale to 3-5 hosts manually

Months 18+:   Nomad (if needed)
               └─ Add when 10+ hosts or auto-scaling required

Never:        Kubernetes
               └─ Unless scale exceeds 500+ hosts

Benefits Summary

Infrastructure Benefits

Metric	Current (Prefect)	Future (Temporal)	Improvement
Services to Manage	6	1	83% reduction
Idle Memory Usage	~2GB	~1GB	50% reduction
Load Memory Usage	~8GB	~4.5GB	44% reduction
Docker Registry	Required	Not needed	Eliminated
Configuration Files	6 service configs	1 config	83% simpler
Deployment Complexity	High	Low	Significant

Operational Benefits

Simpler Monitoring
- 1 service vs 6
- Single Web UI (Temporal)
- Fewer alerts to configure
Easier Debugging
- Complete workflow history in Temporal
- Query workflow state at any time
- Better error visibility
Faster Deployments
- No registry push/pull
- Restart 1 service vs 6
- Quicker iteration cycles
Better Reliability
- Durable execution (workflows survive crashes)
- Automatic retries built-in
- State persistence guaranteed
Clear Scaling Path
- Phase 1: Single host (now)
- Phase 2: Multi-host (6-18 months)
- Phase 3: Nomad cluster (18+ months)

Developer Experience Benefits

Local Development
- Simpler docker-compose
- Faster startup (fewer services)
- Easier to reason about
Module Development
- No changes to BaseModule
- Same discovery mechanism
- Same testing approach
Workflow Development
- Better debugging tools (Temporal Web UI)
- Workflow history visualization
- Easier to test retry logic
Onboarding
- 1 service to understand vs 6
- Clearer architecture
- Less to learn

Risks & Mitigations

Risk 1: Migration Introduces Bugs

Likelihood: Medium Impact: High Risk Score: 6/10

Mitigation:

Phased migration (one workflow at a time)
Parallel run (Prefect + Temporal) during transition
Comprehensive testing before cutover
Rollback plan documented

Risk 2: Performance Degradation

Likelihood: Low Impact: Medium Risk Score: 3/10

Mitigation:

Load testing before production
Monitor key metrics during migration
Temporal proven at higher scale than current
Easy to tune worker concurrency

Risk 3: Team Learning Curve

Likelihood: High Impact: Low Risk Score: 4/10

Mitigation:

Training sessions on Temporal concepts
Pair programming during migration
Comprehensive documentation
Temporal has excellent docs

Risk 4: Unknown Edge Cases

Likelihood: Medium Impact: Medium Risk Score: 5/10

Mitigation:

Thorough testing with real workflows
Gradual rollout (dev → staging → production)
Keep Prefect running initially
Community support available

Risk 5: Module System Incompatibility

Likelihood: Very Low Impact: High Risk Score: 2/10

Mitigation:

Module interface preserved (BaseModule unchanged)
Only orchestration changes
Modules are decoupled from Prefect
Test suite validates module behavior

Risk 6: Long-Lived Worker Stability

Likelihood: Low Impact: Medium Risk Score: 3/10

Mitigation:

Proper resource limits (memory, CPU)
Periodic worker restarts (daily)
Monitoring for memory leaks
Health checks and auto-restart

Overall Risk Assessment

Total Risk Score: 23/60 (38%) - Medium-Low Risk

Conclusion: Risks are manageable with proper planning and mitigation strategies.

Cost Analysis

Current Costs (Prefect)

Infrastructure:

Single Host (8GB RAM, 4 CPU):
  - Cloud VM: $80-120/month
  - Or bare metal amortized: ~$50/month

Services Running:
  - Prefect Server: ~500MB
  - Postgres: ~200MB
  - Redis: ~100MB
  - Registry: ~500MB
  - Docker Proxy: ~50MB
  - Worker: ~500MB
  - Workflows: ~6GB (peak)
  Total: ~8GB

Development Time:
  - Maintenance: ~2 hours/week
  - Debugging: ~3 hours/week
  - Deployments: ~1 hour/week
  Total: 6 hours/week = $600/month (at $25/hour)

Monthly Total: ~$700/month

Future Costs (Temporal)

Phase 1 - Single Host:

Single Host (6GB RAM, 4 CPU):
  - Cloud VM: $60-80/month
  - Or bare metal amortized: ~$40/month

Services Running:
  - Temporal: ~1GB
  - Workers: ~3.5GB
  - Workflows: ~1GB (peak)
  Total: ~5.5GB

Development Time:
  - Maintenance: ~1 hour/week
  - Debugging: ~2 hours/week
  - Deployments: ~0.5 hour/week
  Total: 3.5 hours/week = $350/month

Monthly Total: ~$430/month

Phase 2 - Multi-Host (3 hosts):

3 Hosts + S3 Storage:
  - Cloud VMs: $180-240/month
  - S3 storage (1TB): ~$23/month
  - S3 transfer (100GB): ~$9/month

Development Time:
  - Maintenance: ~2 hours/week
  - Monitoring: ~2 hours/week
  Total: 4 hours/week = $400/month

Monthly Total: ~$670/month (3× capacity)

Phase 3 - Nomad Cluster (10+ hosts):

Nomad Cluster:
  - 3 Nomad servers: $120/month
  - 10 worker hosts: $800/month
  - S3 storage (5TB): ~$115/month
  - Load balancer: ~$20/month

Development Time:
  - Nomad maintenance: ~3 hours/week
  - Monitoring: ~3 hours/week
  Total: 6 hours/week = $600/month

Monthly Total: ~$1,655/month (10× capacity)

Cost Comparison

Phase	Hosts	Capacity	Monthly Cost	Cost per Workflow
Current (Prefect)	1	10K/day	$700	$0.0023
Phase 1 (Temporal)	1	10K/day	$430	$0.0014
Phase 2 (Temporal)	3	30K/day	$670	$0.0007
Phase 3 (Nomad)	10	100K/day	$1,655	$0.0005

Savings:

Phase 1 vs Current: $270/month (39% reduction)
Better cost efficiency as scale increases

Timeline & Effort

Phase 1: Temporal Migration (8 Weeks)

Week 1-2: Foundation

Deploy Temporal server
Remove Prefect infrastructure
Implement storage abstraction layer
Effort: 60-80 hours

Week 3-4: Workers

Create long-lived worker pools
Implement module auto-discovery
Configure Docker Compose
Effort: 60-80 hours

Week 5-6: Workflows

Migrate workflows to Temporal
Convert @flow → @workflow.defn
Test all workflows
Effort: 60-80 hours

Week 7: Integration

Update backend API
End-to-end testing
Load testing
Effort: 40-60 hours

Week 8: Documentation & Cleanup

Update documentation
Remove old code
Training sessions
Effort: 30-40 hours

Total Effort: 250-340 hours (~2 engineers for 2 months)

Phase 2: Multi-Host (When Needed)

Effort: 40-60 hours

Set up S3 storage
Deploy to multiple hosts
Configure load balancing
Test and validate

Phase 3: Nomad (If Needed)

Effort: 80-120 hours

Install Nomad cluster
Convert jobs to Nomad
Set up auto-scaling
Production deployment

Licensing Considerations

Overview

Critical Context: FuzzForge is a generic platform where modules and workflows "could be anything" - not limited to fuzzing or security analysis. This significantly impacts the licensing assessment, particularly for Nomad's Business Source License.

Temporal Licensing: ✅ SAFE

License: MIT License

Status: Fully open source, zero restrictions

Commercial Use:

✅ Use in production
✅ Sell services built on Temporal
✅ Modify source code
✅ Redistribute
✅ Sublicense
✅ Private use

Conclusion: Temporal has no licensing concerns for any use case. You can build any type of platform (fuzzing, security, generic workflows, orchestration-as-a-service) without legal risk.

Reference: https://github.com/temporalio/temporal/blob/master/LICENSE

Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION

License: Business Source License 1.1 (BSL 1.1)

Status: Source-available but with restrictions

BSL 1.1 Key Terms

Change Date: 4 years after each version release Change License: Mozilla Public License 2.0 (MPL 2.0)

After 4 years: Each version becomes fully open source under MPL 2.0

The Critical Restriction

Additional Use Grant:
You may make use of the Licensed Work, provided that you do not use
the Licensed Work for a Competitive Offering.

A "Competitive Offering" is a commercial product or service that is:
1. Substantially similar to the capabilities of the Licensed Work
2. Offered to third parties on a paid or free basis

What This Means for FuzzForge

The licensing risk depends on how FuzzForge is marketed and positioned:

✅ LIKELY SAFE: Specific Use Case Platform

If FuzzForge is marketed as a specialized platform for specific domains:

Examples:

✅ "FuzzForge - Security Analysis Platform"
✅ "FuzzForge - Automated Fuzzing Service"
✅ "FuzzForge - Code Analysis Tooling"
✅ "FuzzForge - Vulnerability Assessment Platform"

Why Safe:

Nomad is used internally for infrastructure
Customer is buying fuzzing/security services, not orchestration
Platform's value is the domain expertise, not the scheduler
Not competing with HashiCorp's offerings

⚠️ GRAY AREA: Generic Workflow Platform

If FuzzForge pivots to emphasize generic workflow capabilities:

Examples:

⚠️ "FuzzForge - Workflow Orchestration Platform"
⚠️ "FuzzForge - Run any containerized workload"
⚠️ "FuzzForge - Generic task scheduler"
⚠️ Marketing that emphasizes "powered by Nomad"

Why Risky:

Could be seen as competing with Nomad Enterprise
Offering similar capabilities to HashiCorp's products
Customer might use it as Nomad replacement

❌ CLEARLY VIOLATES: Orchestration-as-a-Service

If FuzzForge becomes primarily an orchestration product:

Examples:

❌ "FuzzForge Orchestrator - Schedule any workload"
❌ "Nomad-as-a-Service powered by FuzzForge"
❌ "Generic container orchestration platform"
❌ Reselling Nomad capabilities with thin wrapper

Why Violation:

Directly competing with HashiCorp Nomad offerings
"Substantially similar" to Nomad's capabilities
Commercial offering of orchestration

Real-World Precedents

HashiCorp has NOT (as of 2025) aggressively enforced BSL against companies using their tools internally. The restriction targets:

Cloud providers offering "managed Nomad" services
Companies building Nomad competitors
Vendors reselling HashiCorp functionality

NOT targeting:

Companies using Nomad for internal infrastructure
SaaS platforms that happen to use Nomad
Domain-specific platforms (like FuzzForge's security focus)

Decision Tree: Should I Use Nomad?

┌─────────────────────────────────────┐
│ Is orchestration your core product? │
└─────────────────────────────────────┘
                  │
         ┌────────┴────────┐
         │                 │
        YES               NO
         │                 │
    ┌────┴────┐       ┌────┴────┐
    │ DON'T   │       │ What's   │
    │ USE     │       │ your     │
    │ NOMAD   │       │ value    │
    │         │       │ prop?    │
    └─────────┘       └─────┬────┘
                            │
                ┌───────────┴───────────┐
                │                       │
          Domain Expertise      Orchestration Features
          (Fuzzing, Security)   (Scheduling, Auto-scale)
                │                       │
           ┌────┴────┐            ┌────┴────┐
           │ SAFE TO │            │ RISKY - │
           │ USE     │            │ CONSULT │
           │ NOMAD   │            │ LAWYER  │
           └─────────┘            └─────────┘

FuzzForge Current Position

Current Positioning: Domain-specific security/analysis platform Nomad Usage: Internal infrastructure (not customer-facing) Risk Level: LOW (likely safe)

However, user stated: "modules and workflows could be anything" - this suggests potential future expansion beyond security domain.

If FuzzForge pivots to generic platform:

Risk increases from LOW → MEDIUM
Need legal review before Phase 3 (Nomad migration)
Consider Kubernetes as alternative

Kubernetes Licensing: ✅ SAFE

License: Apache License 2.0

Status: Fully open source, zero restrictions

Commercial Use:

✅ Use in production
✅ Sell services built on Kubernetes
✅ Modify source code
✅ Offer managed Kubernetes (AWS EKS, GCP GKE do this)
✅ Build competitive offerings

Conclusion: Kubernetes has no licensing concerns whatsoever, even for orchestration-as-a-service offerings.

Docker Licensing: ✅ SAFE

License: Apache License 2.0

Status: Fully open source

Note: Docker Desktop has separate commercial licensing requirements for organizations >250 employees or >$10M revenue, but Docker Engine (which FuzzForge uses) remains free for all uses.

Licensing Recommendation Matrix

Component	License	FuzzForge Risk	Recommendation
Temporal	MIT	✅ None	Use freely
Docker Engine	Apache 2.0	✅ None	Use freely
Nomad	BSL 1.1	⚠️ Low-Medium	Safe if domain-specific
Kubernetes	Apache 2.0	✅ None	Safe alternative to Nomad

Recommendations by Phase

Phase 1 & 2: Temporal + Docker Compose

Licenses: MIT (Temporal) + Apache 2.0 (Docker) Risk: ✅ ZERO - Fully safe for any use case

Action: Proceed without legal review required

Phase 3: Adding Nomad (18+ months)

License: BSL 1.1 Risk: ⚠️ LOW-MEDIUM - Depends on positioning

Action Required BEFORE Migration:

Clarify Product Positioning
- Will FuzzForge market as generic platform?
- Or remain domain-specific (security/fuzzing)?
Legal Review (Recommended)
- Consult IP lawyer familiar with BSL
- Show marketing materials, website copy
- Get written opinion on BSL compliance
- Cost: $2,000-5,000 (one-time)

Decision Point:

IF positioning = domain-specific (security/fuzzing)
THEN proceed with Nomad (low risk)

ELSE IF positioning = generic platform
THEN consider Kubernetes instead (zero risk)

Alternative: Use Kubernetes Instead of Nomad

If concerned about Nomad BSL risk:

Pros:

✅ Zero licensing risk (Apache 2.0)
✅ Can offer orchestration-as-a-service freely
✅ Larger ecosystem and community
✅ Managed offerings on all clouds

Cons:

❌ Higher operational complexity than Nomad
❌ Overkill for batch workload use case
❌ Steeper learning curve

When to Choose K8s Over Nomad:

Planning to market as generic platform
Uncomfortable with BSL restrictions
Need absolute licensing certainty
Have K8s expertise already

Licensing Risk Summary

Scenario	Temporal	Docker	Nomad	Kubernetes
Security platform (current)	✅ Safe	✅ Safe	✅ Safe	✅ Safe
Generic workflow platform	✅ Safe	✅ Safe	⚠️ Risky	✅ Safe
Orchestration-as-a-service	✅ Safe	✅ Safe	❌ Violation	✅ Safe

Key Takeaways

Temporal is completely safe - MIT license has zero restrictions for any use case
Nomad's BSL depends on positioning:
- ✅ Safe for domain-specific platforms (security, fuzzing)
- ⚠️ Risky for generic workflow platforms
- ❌ Violation for orchestration-as-a-service
User's statement matters: "modules could be anything" suggests generic platform potential → increases Nomad risk
Mitigation strategies:
- Keep marketing focused on domain expertise
- Get legal review before Phase 3 (Nomad)
- Alternative: Use Kubernetes (Apache 2.0) instead
Decision timing: No urgency - Nomad decision is 18+ months away (Phase 3)

Recommended approach:

Now → Phase 1-2:    Temporal + Docker Compose (zero risk)
18 months → Phase 3: Re-evaluate positioning
                     → Domain-specific? Use Nomad
                     → Generic platform? Use Kubernetes

Recommendation

Primary Recommendation: PROCEED WITH TEMPORAL MIGRATION

Confidence Level: High (8/10)

Rationale

Technical Benefits Outweigh Costs
- 83% infrastructure reduction
- 44% resource savings
- Clear scaling path
- Better reliability
Manageable Risks
- Low-medium risk profile
- Good mitigation strategies
- Rollback plan exists
- Module system preserved
Right Timing
- Before production scale makes migration harder
- Team capacity available
- Module architecture stable
- Clear 8-week timeline
Future-Proof
- Easy Nomad migration when needed
- Multi-host ready (storage abstraction)
- Industry-proven technology
- Growing ecosystem

Phased Approach

Immediate (Now):

✅ Approve Temporal migration
✅ Allocate 2 engineers for 8 weeks
✅ Set Week 1 start date

Near-Term (Months 1-6):

✅ Complete Temporal migration
✅ Validate in production
✅ Optimize performance

Mid-Term (Months 6-18):

⏳ Monitor scaling needs
⏳ Implement S3 storage if needed
⏳ Expand to multi-host if needed

Long-Term (Months 18+):

⏳ Evaluate Nomad necessity
⏳ Migrate to Nomad if triggers met
⏳ Continue scaling horizontally

Decision Criteria

Proceed with Migration if:

✅ Team agrees on benefits (CHECK)
✅ 8-week timeline acceptable (CHECK)
✅ Resources available (CHECK)
✅ Risk profile acceptable (CHECK)

Defer Migration if:

❌ Critical features launching soon (DEPENDS)
❌ Team capacity constrained (DEPENDS)
❌ Major Prefect improvements announced (UNLIKELY)

Alternative: Start Smaller

If full migration seems risky:

Proof of Concept (2 weeks)
- Migrate one simple workflow
- Validate Temporal locally
- Assess complexity
- Decision point: Continue or abort
Parallel Run (4 weeks)
- Run Temporal alongside Prefect
- Duplicate one workflow
- Compare results
- Build confidence
Full Migration (6 weeks)
- If POC successful, proceed
- Migrate remaining workflows
- Decommission Prefect

Total: 12 weeks (vs 8 weeks direct)

Appendix: Quick Reference

One-Page Summary

WHAT: Migrate from Prefect to Temporal WHY: Simpler (6 services → 1), more scalable, better reliability WHEN: Now (8 weeks) WHO: 2 engineers COST: $430/month (vs $700 current) = 39% savings RISK: Medium-Low (manageable) OUTCOME: Production-ready infrastructure with clear scaling path

Key Metrics

Metric	Current	Future	Change
Services	6	1	-83%
Memory	8GB	4.5GB	-44%
Cost	$700/mo	$430/mo	-39%
Capacity	10K/day	10K/day	Same (Phase 1)
Dev Time	6h/week	3.5h/week	-42%

Decision Checklist

Review this document with team
Discuss concerns and questions
Vote: Proceed / Defer / Reject
If proceed: Assign engineers
If proceed: Set start date
If defer: Set review date (3 months)
If reject: Document reasons

Document Version: 1.0 Last Updated: 2025-09-30 Next Review: After decision or in 3 months

39 KiB Raw Permalink Blame History Unescape Escape

FuzzForge AI: Migration Decision Document

🔄 CRITICAL UPDATE (2025-10-01)

What Changed

Revised Architecture

Key Insights

What This Means

Table of Contents

Executive Summary

The Proposal

Why Consider This?

Key Decision Points

Bottom Line

Current State Analysis

Prefect Architecture (Current)

Strengths of Current Setup

Pain Points

Why Change Now?

Proposed Solution: Temporal Migration

Target Architecture

Migration Phases

For & Against: Temporal vs Prefect

Option A: Keep Prefect (Status Quo)

✅ For (Arguments to Keep Prefect)

❌ Against (Arguments Against Keeping Prefect)

Option B: Migrate to Temporal (Recommended)

✅ For (Arguments to Migrate)

❌ Against (Arguments Against Migration)

Scoring Matrix

For & Against: Long-Lived vs Ephemeral Workers

Context

Option A: Ephemeral Containers

✅ For

❌ Against

Option B: Long-Lived Workers (Recommended)

✅ For

❌ Against

Decision Matrix

Future Consideration: Nomad vs Kubernetes vs Docker Compose

When to Consider Orchestration Beyond Docker Compose?

Option A: Docker Compose (Recommended for Phase 1-2)

✅ For

❌ Against

Option B: Nomad (Recommended for Phase 3)

✅ For

❌ Against

Option C: Kubernetes

✅ For

❌ Against

Comparison Matrix

Recommendation Timeline

Benefits Summary

Infrastructure Benefits

Operational Benefits

Developer Experience Benefits

Risks & Mitigations

Risk 1: Migration Introduces Bugs

Risk 2: Performance Degradation

Risk 3: Team Learning Curve

Risk 4: Unknown Edge Cases

Risk 5: Module System Incompatibility

Risk 6: Long-Lived Worker Stability

Overall Risk Assessment

Cost Analysis

Current Costs (Prefect)

Future Costs (Temporal)

Cost Comparison

Timeline & Effort

Phase 1: Temporal Migration (8 Weeks)

Phase 2: Multi-Host (When Needed)

Phase 3: Nomad (If Needed)

Licensing Considerations

Overview

Temporal Licensing: ✅ SAFE

Nomad Licensing: ⚠️ REQUIRES CAREFUL EVALUATION

BSL 1.1 Key Terms

The Critical Restriction

What This Means for FuzzForge

✅ LIKELY SAFE: Specific Use Case Platform

⚠️ GRAY AREA: Generic Workflow Platform

39 KiB

Raw Permalink Blame History