Add event-driven Cognee ingestion pipeline

2026-02-13 00:32:47 +00:00 · 2025-11-07 17:21:18 +01:00
parent 83244ee537
commit 3ef6bf2437
17 changed files with 795 additions and 82 deletions
--- a/docs/docs/ai/a2a-services.md
+++ b/docs/docs/ai/a2a-services.md
@@ -10,9 +10,9 @@ fuzzforge ai server

 Run the command from a project directory that already contains `.fuzzforge/`. The server reads the project configuration and reuses the same environment variables as the CLI shell.

-**Default directories**
+**Default directories / services**
 - Logs: `.fuzzforge/logs/cognee.log`
- Cognee datasets: `.fuzzforge/cognee/project_<id>/{data,system}`
+- Cognee datasets: hosted by the shared Cognee service (`COGNEE_SERVICE_URL`) inside the configured MinIO/S3 bucket. Local mode falls back to `.fuzzforge/cognee/project_<id>/{data,system}`. Uploads dropped into `s3://cognee/projects/<project-id>/...` are ingested automatically via RabbitMQ + the dispatcher.
 - Artifact cache: `.fuzzforge/artifacts`

 ## HTTP Endpoints
--- a/docs/docs/ai/architecture.md
+++ b/docs/docs/ai/architecture.md
@@ -73,7 +73,8 @@ sequenceDiagram
 - **Remote agent registry** (`ai/src/fuzzforge_ai/remote_agent.py`) holds metadata for downstream agents and handles capability discovery over HTTP. Auto-registration is configured by `ConfigManager` so known agents attach on startup.
 - **Memory services**:
  - `FuzzForgeMemoryService` and `HybridMemoryManager` (`ai/src/fuzzforge_ai/memory_service.py`) provide conversation recall and bridge to Cognee datasets when configured.
-  - Cognee bootstrap (`ai/src/fuzzforge_ai/cognee_service.py`) ensures ingestion and knowledge queries stay scoped to the current project.
+- Cognee bootstrap (`ai/src/fuzzforge_ai/cognee_service.py`) ensures ingestion and knowledge queries stay scoped to the current project and forwards them to the shared Cognee service (`COGNEE_SERVICE_URL`). Datasets live inside the configured MinIO/S3 bucket, with `.fuzzforge/cognee/` available only when `COGNEE_STORAGE_BACKEND=local`.
+- MinIO bucket notifications push object-created events into RabbitMQ. The `ingestion-dispatcher` container listens on `cognee-ingest`, downloads the object, and invokes Cognee’s REST API on behalf of the project’s tenant so uploads become datasets without a manual CLI hop.

 ## Workflow Automation

@@ -91,7 +92,7 @@ The CLI surface mirrors these helpers as natural-language prompts (`You> run fuz

 ## Knowledge & Ingestion

- The `fuzzforge ingest` and `fuzzforge rag ingest` commands call into `ai/src/fuzzforge_ai/ingest_utils.py`, which filters file types, ignores caches, and populates Cognee datasets under `.fuzzforge/cognee/project_<id>/`.
+- The `fuzzforge ingest` and `fuzzforge rag ingest` commands call into `ai/src/fuzzforge_ai/ingest_utils.py`, which filters file types, ignores caches, and streams files to the Cognee service where they are stored under `s3://<bucket>/<prefix>/project_<id>/`. When files land directly in `s3://cognee/projects/<project-id>/<category>/...`, the dispatcher performs the same workflow automatically via RabbitMQ events.
 - Runtime queries hit `query_project_knowledge_api` on the executor, which defers to `cognee_service` for dataset lookup and semantic search. When Cognee credentials are absent the tools return a friendly "not configured" response.

 ## Artifact Pipeline
@@ -140,7 +141,7 @@ graph LR
 - **Session persistence** is controlled by `SESSION_PERSISTENCE`. When set to `sqlite`, ADK’s `DatabaseSessionService` writes transcripts to the path configured by `SESSION_DB_PATH` (defaults to `./fuzzforge_sessions.db`). With `inmemory`, the context is scoped to the current process.
 - **Semantic recall** stores vector embeddings so `/recall` queries can surface earlier prompts, even after restarts when using SQLite.
 - **Hybrid memory manager** (`HybridMemoryManager`) stitches Cognee results into the ADK session. When a knowledge query hits Cognee, the relevant nodes are appended back into the session context so follow-up prompts can reference them naturally.
- **Cognee datasets** are unique per project. Ingestion runs populate `<project>_codebase` while custom calls to `ingest_to_dataset` let you maintain dedicated buckets (e.g., `insights`). Data is persisted inside `.fuzzforge/cognee/project_<id>/` and shared across CLI and A2A modes.
+- **Cognee datasets** are unique per project. Ingestion runs populate `<project>_codebase` while custom calls to `ingest_to_dataset` let you maintain dedicated buckets (e.g., `insights`). Data is persisted inside the Cognee service’s bucket/prefix and is shared across CLI, HTTP server, and MCP integrations.
 - **Task metadata** (workflow runs, artifact descriptors) lives in the executor’s in-memory caches but is also mirrored through A2A task events so remote agents can resubscribe if the CLI restarts.
 - **Operational check**: Run `/recall <keyword>` or `You> search project knowledge for "topic" using INSIGHTS` after ingestion to confirm both ADK session recall and Cognee graph access are active.
 - **CLI quick check**: `/memory status` summarises the current memory type, session persistence, and Cognee dataset directories from inside the agent shell.
--- a/docs/docs/ai/configuration.md
+++ b/docs/docs/ai/configuration.md
@@ -81,6 +81,33 @@ LLM_COGNEE_API_KEY=sk-your-key

 If the Cognee variables are omitted, graph-specific tools remain available but return a friendly "not configured" response.

+### Cognee Storage Backend
+
+Cognee defaults to local storage under `.fuzzforge/cognee/`, but you can mirror datasets to MinIO/S3 for multi-tenant or containerised deployments:
+
+```env
+COGNEE_STORAGE_BACKEND=s3
+COGNEE_S3_BUCKET=cognee
+COGNEE_S3_PREFIX=project_${PROJECT_ID}
+COGNEE_S3_ENDPOINT=http://localhost:9000
+COGNEE_S3_REGION=us-east-1
+COGNEE_S3_ACCESS_KEY=fuzzforge
+COGNEE_S3_SECRET_KEY=fuzzforge123
+COGNEE_S3_ALLOW_HTTP=1
+```
+
+Set the values to match your MinIO/S3 endpoint; the docker compose stack seeds a `cognee` bucket automatically. When S3 mode is active, ingestion and search work exactly the same but Cognee writes metadata to `s3://<bucket>/<prefix>/project_<id>/{data,system}`.
+
+### Cognee Service URL
+
+The CLI and workers talk to Cognee over HTTP. Point `COGNEE_SERVICE_URL` at the service (defaults to `http://localhost:18000` when you run `docker/docker-compose.cognee.yml`) and provide `COGNEE_API_KEY` if you protect the API behind LiteLLM.
+
+Every project gets its own Cognee login so datasets stay isolated. The CLI auto-derives an email/password pair (e.g., `project_<id>@fuzzforge.dev`) and registers it the first time you run `fuzzforge ingest`. Override those defaults by setting `COGNEE_SERVICE_EMAIL` / `COGNEE_SERVICE_PASSWORD` in `.fuzzforge/.env` before running ingestion if you need to reuse an existing account.
+
+### MinIO Event Mapping
+
+The ingestion dispatcher converts S3 prefixes to datasets using `DATASET_CATEGORY_MAP` (default `files:codebase,findings:findings,docs:docs`). Adjust it in `docker-compose.yml` if you want to add more categories or rename datasets.
+
 ## MCP / Backend Integration

 ```env
--- a/docs/docs/ai/ingestion.md
+++ b/docs/docs/ai/ingestion.md
@@ -38,12 +38,13 @@ All runs automatically skip `.fuzzforge/**` and `.git/**` to avoid recursive ing

 - Primary dataset: `<project>_codebase`
 - Additional datasets: create ad-hoc buckets such as `insights` via the `ingest_to_dataset` tool
- Storage location: `.fuzzforge/cognee/project_<id>/`
+- Storage location (service default): `s3://<bucket>/<prefix>/project_<id>/{data,system}` as defined by the Cognee service (the docker compose stack seeds a `cognee` bucket automatically).
+- Local mode (opt-in): set `COGNEE_STORAGE_BACKEND=local` to fall back to `.fuzzforge/cognee/project_<id>/` when developing without MinIO.

 ### Persistence Details

- Every dataset lives under `.fuzzforge/cognee/project_<id>/{data,system}`. These directories are safe to commit to long-lived storage (they only contain embeddings and metadata).
- Cognee assigns deterministic IDs per project; if you move the repository, copy the entire `.fuzzforge/cognee/` tree to retain graph history.
+- The Cognee service keeps datasets inside the configured bucket/prefix (`s3://<bucket>/<prefix>/project_<id>/{data,system}`) so every project has its own Ladybug + LanceDB pair. Local mode mirrors the same layout under `.fuzzforge/cognee/project_<id>/`.
+- Cognee assigns deterministic IDs per project; copy the entire prefix (local or S3) if you migrate repositories to retain graph history.
 - `HybridMemoryManager` ensures answers from Cognee are written back into the ADK session store so future prompts can refer to the same nodes without repeating the query.
 - All Cognee processing runs locally against the files you ingest. No external service calls are made unless you configure a remote Cognee endpoint.

@@ -77,10 +78,40 @@ FUZZFORGE_MCP_URL=http://localhost:8010/mcp
 LLM_COGNEE_PROVIDER=openai
 LLM_COGNEE_MODEL=gpt-5-mini
 LLM_COGNEE_API_KEY=sk-your-key
+COGNEE_SERVICE_URL=http://localhost:18000
+COGNEE_API_KEY=
 ```

+The CLI auto-registers a dedicated Cognee account per project the first time you ingest (email pattern `project_<id>@cognee.local`). Set `COGNEE_SERVICE_EMAIL` / `COGNEE_SERVICE_PASSWORD` in `.fuzzforge/.env` if you prefer to reuse an existing login.
+
+Switch the knowledge graph storage to S3/MinIO by adding:
+
+```env
+COGNEE_STORAGE_BACKEND=s3
+COGNEE_S3_BUCKET=cognee
+COGNEE_S3_PREFIX=project_${PROJECT_ID}
+COGNEE_S3_ENDPOINT=http://localhost:9000
+COGNEE_S3_ACCESS_KEY=fuzzforge
+COGNEE_S3_SECRET_KEY=fuzzforge123
+COGNEE_S3_ALLOW_HTTP=1
+```
+
+The default `docker-compose` stack already seeds a `cognee` bucket inside MinIO so these values work out-of-the-box. Point `COGNEE_SERVICE_URL` at the Cognee container (included in `docker/docker-compose.cognee.yml`) so `fuzzforge ingest` sends all requests to the shared service instead of importing Cognee locally.
+
 Add comments or project-specific overrides as needed; the agent reads these variables on startup.

+## Event-Driven Ingestion
+
+Uploading files directly into MinIO triggers Cognee automatically. The dispatcher watches `s3://cognee/projects/<project-id>/...` and translates the top-level folder into a dataset:
+
+| Prefix     | Dataset name                          |
+|-----------|---------------------------------------|
+| `files/`  | `<project-id>_codebase`                |
+| `findings/` | `<project-id>_findings`             |
+| `docs/`   | `<project-id>_docs`                    |
+
+Under the hood MinIO publishes a `PUT` event → RabbitMQ (`cognee-ingest` exchange) → the `ingestion-dispatcher` container downloads the object and calls `/api/v1/add` + `/api/v1/cognify` using the deterministic project credentials (`project_<id>@fuzzforge.dev`). That means rsync, `aws s3 cp`, GitHub Actions, or any other tool that writes to the bucket can seed Cognee without touching the CLI.
+
 ## Tips

 - Re-run ingestion after significant code changes to keep the knowledge graph fresh.
--- a/docs/docs/how-to/docker-setup.md
+++ b/docs/docs/how-to/docker-setup.md
@@ -231,6 +231,20 @@ nano volumes/env/.env

 See [Getting Started](../tutorial/getting-started.md) for detailed environment setup.

+### Cognee Service Stack
+
+Cognee now runs as its own container so every project shares the same multi-tenant backend (Ladybug + LanceDB sitting on MinIO). After the core stack is running, bring the service online with:
+
+```bash
+docker compose -f docker/docker-compose.cognee.yml up -d
+```
+
+This spins up the Cognee API on `http://localhost:18000`, publishes it to the host, and stores knowledge graphs in the `cognee` bucket that the main compose file seeds. Point the CLI at it by setting `COGNEE_SERVICE_URL=http://localhost:18000` (already included in `.env.template`).
+
+### RabbitMQ + Dispatcher
+
+`docker-compose.yml` also launches RabbitMQ (`http://localhost:15672`, ingest/ingest) and the `ingestion-dispatcher` container. MinIO publishes `PUT` events from `s3://cognee/projects/<project-id>/...` to the `cognee-ingest` exchange, and the dispatcher downloads the object and calls Cognee’s REST API. That means any rsync/upload into the projects bucket automatically becomes a dataset.
+
 ---

 ## Troubleshooting