Store Cognee datasets in the projects bucket

This commit is contained in:
Songbird
2025-11-07 17:29:37 +01:00
parent 3ef6bf2437
commit 44d61ad4bf
9 changed files with 23 additions and 18 deletions

View File

@@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### ✨ Enhancements
- Added Ladybug-backed Cognee integration with optional MinIO/S3 storage. Projects can now set `COGNEE_STORAGE_BACKEND=s3` (plus the `COGNEE_S3_*` settings) to keep knowledge graphs in the shared MinIO bucket seeded by `docker-compose`, enabling multi-tenant ingestion across workers and containers.
- Introduced a dedicated Cognee service (`docker/docker-compose.cognee.yml`) and HTTP client so `fuzzforge ingest` streams data to the shared backend (`COGNEE_SERVICE_URL`) instead of importing Cognee locally. Each project now auto-provisions its own Cognee account/tenant and authenticates via the REST API, keeping datasets isolated even though the service is shared.
- Added an event-driven ingestion pipeline: MinIO publishes `PUT` events from `s3://cognee/projects/<project-id>/...` to RabbitMQ, and the new `ingestion-dispatcher` container downloads the file, logs into Cognee as that projects tenant, and invokes `/api/v1/add` + `/api/v1/cognify`. Uploading files (rsync, CI, etc.) now keeps datasets fresh without touching the CLI.
- Added an event-driven ingestion pipeline: MinIO publishes `PUT` events from `s3://projects/<project-id>/...` to RabbitMQ, and the new `ingestion-dispatcher` container downloads the file, logs into Cognee as that projects tenant, and invokes `/api/v1/add` + `/api/v1/cognify`. Uploading files (rsync, CI, etc.) now keeps datasets fresh without touching the CLI.
### 📝 Documentation
- Added comprehensive worker startup documentation across all guides

View File

@@ -197,7 +197,7 @@ ff workflow run security_assessment . # Start workflow - CLI uploads files au
Uploading files into MinIO automatically streams them into Cognee:
```
s3://cognee/projects/<project-id>/
s3://projects/<project-id>/
files/... # → <project-id>_codebase dataset
findings/... # → <project-id>_findings dataset
docs/... # → <project-id>_docs dataset

View File

@@ -138,7 +138,7 @@ class CogneeConfig(BaseModel):
service_password: Optional[str] = None
storage_backend: Literal["local", "s3"] = "s3"
s3_bucket: Optional[str] = None
s3_prefix: Optional[str] = "projects"
s3_prefix: Optional[str] = None
s3_endpoint_url: Optional[str] = None
s3_region: Optional[str] = None
s3_access_key: Optional[str] = None
@@ -217,8 +217,12 @@ class FuzzForgeConfig(BaseModel):
cognee.service_url = "http://localhost:18000"
changed = True
if not cognee.s3_prefix:
cognee.s3_prefix = "projects"
if not cognee.s3_bucket:
cognee.s3_bucket = "projects"
changed = True
if cognee.s3_prefix is None:
cognee.s3_prefix = ""
changed = True
default_email = f"project_{self.project.id}@fuzzforge.dev"
@@ -234,9 +238,13 @@ class FuzzForgeConfig(BaseModel):
changed = True
if cognee.storage_backend.lower() == "s3":
bucket = cognee.s3_bucket or "cognee"
prefix = (cognee.s3_prefix or "projects").strip("/")
base_uri = f"s3://{bucket}/{prefix}/{self.project.id}"
bucket = cognee.s3_bucket or "projects"
prefix = (cognee.s3_prefix or "").strip("/")
path_parts = [f"s3://{bucket}"]
if prefix:
path_parts.append(prefix)
path_parts.append(self.project.id)
base_uri = "/".join(path_parts)
data_dir = f"{base_uri}/files"
system_dir = f"{base_uri}/graph"
else:

View File

@@ -12,10 +12,7 @@ services:
GRAPH_DATABASE_PROVIDER: ladybug
ENABLE_BACKEND_ACCESS_CONTROL: "true"
STORAGE_BACKEND: s3
STORAGE_BUCKET_NAME: ${COGNEE_S3_BUCKET:-cognee}
DATA_ROOT_DIRECTORY: s3://${COGNEE_S3_BUCKET:-cognee}/${COGNEE_S3_PREFIX:-projects}
SYSTEM_ROOT_DIRECTORY: s3://${COGNEE_S3_BUCKET:-cognee}/${COGNEE_S3_PREFIX:-projects}
CACHE_ROOT_DIRECTORY: s3://${COGNEE_S3_BUCKET:-cognee}/${COGNEE_S3_PREFIX:-projects}/cache
STORAGE_BUCKET_NAME: ${COGNEE_S3_BUCKET:-projects}
DB_PROVIDER: sqlite
DB_PATH: /data/relational
DB_NAME: cognee.db

View File

@@ -12,7 +12,7 @@ Run the command from a project directory that already contains `.fuzzforge/`. Th
**Default directories / services**
- Logs: `.fuzzforge/logs/cognee.log`
- Cognee datasets: hosted by the shared Cognee service (`COGNEE_SERVICE_URL`) inside the configured MinIO/S3 bucket. Local mode falls back to `.fuzzforge/cognee/project_<id>/{data,system}`. Uploads dropped into `s3://cognee/projects/<project-id>/...` are ingested automatically via RabbitMQ + the dispatcher.
- Cognee datasets: hosted by the shared Cognee service (`COGNEE_SERVICE_URL`) inside the configured MinIO/S3 bucket. Local mode falls back to `.fuzzforge/cognee/project_<id>/{data,system}`. Uploads dropped into `s3://projects/<project-id>/...` are ingested automatically via RabbitMQ + the dispatcher.
- Artifact cache: `.fuzzforge/artifacts`
## HTTP Endpoints

View File

@@ -92,7 +92,7 @@ The CLI surface mirrors these helpers as natural-language prompts (`You> run fuz
## Knowledge & Ingestion
- The `fuzzforge ingest` and `fuzzforge rag ingest` commands call into `ai/src/fuzzforge_ai/ingest_utils.py`, which filters file types, ignores caches, and streams files to the Cognee service where they are stored under `s3://<bucket>/<prefix>/project_<id>/`. When files land directly in `s3://cognee/projects/<project-id>/<category>/...`, the dispatcher performs the same workflow automatically via RabbitMQ events.
- The `fuzzforge ingest` and `fuzzforge rag ingest` commands call into `ai/src/fuzzforge_ai/ingest_utils.py`, which filters file types, ignores caches, and streams files to the Cognee service where they are stored under `s3://projects/<project-id>/`. When files land directly in `s3://projects/<project-id>/<category>/...`, the dispatcher performs the same workflow automatically via RabbitMQ events.
- Runtime queries hit `query_project_knowledge_api` on the executor, which defers to `cognee_service` for dataset lookup and semantic search. When Cognee credentials are absent the tools return a friendly "not configured" response.
## Artifact Pipeline

View File

@@ -102,7 +102,7 @@ Add comments or project-specific overrides as needed; the agent reads these vari
## Event-Driven Ingestion
Uploading files directly into MinIO triggers Cognee automatically. The dispatcher watches `s3://cognee/projects/<project-id>/...` and translates the top-level folder into a dataset:
Uploading files directly into MinIO triggers Cognee automatically. The dispatcher watches `s3://projects/<project-id>/...` and translates the top-level folder into a dataset:
| Prefix | Dataset name |
|-----------|---------------------------------------|

View File

@@ -243,7 +243,7 @@ This spins up the Cognee API on `http://localhost:18000`, publishes it to the ho
### RabbitMQ + Dispatcher
`docker-compose.yml` also launches RabbitMQ (`http://localhost:15672`, ingest/ingest) and the `ingestion-dispatcher` container. MinIO publishes `PUT` events from `s3://cognee/projects/<project-id>/...` to the `cognee-ingest` exchange, and the dispatcher downloads the object and calls Cognees REST API. That means any rsync/upload into the projects bucket automatically becomes a dataset.
`docker-compose.yml` also launches RabbitMQ (`http://localhost:15672`, ingest/ingest) and the `ingestion-dispatcher` container. MinIO publishes `PUT` events from `s3://projects/<project-id>/...` to the `cognee-ingest` exchange, and the dispatcher downloads the object and calls Cognees REST API. That means any rsync/upload into the projects bucket automatically becomes a dataset.
---

View File

@@ -69,8 +69,8 @@ COGNEE_MCP_URL=
COGNEE_SERVICE_URL=http://localhost:18000
COGNEE_API_KEY=
COGNEE_STORAGE_BACKEND=s3
COGNEE_S3_BUCKET=cognee
COGNEE_S3_PREFIX=projects
COGNEE_S3_BUCKET=projects
COGNEE_S3_PREFIX=
COGNEE_S3_ENDPOINT=http://localhost:9000
COGNEE_S3_REGION=us-east-1
COGNEE_S3_ACCESS_KEY=fuzzforge