Commit Graph

12 Commits

Author SHA1 Message Date
Garry Tan 7b99b8c0eb fix: run eval container as non-root — claude CLI rejects --dangerously-skip-permissions as root
Claude Code CLI blocks --dangerously-skip-permissions when running
as uid=0 for security. Add a 'runner' user to the Docker image and
set --user runner on the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 07:18:30 -07:00
Garry Tan 4d3f8b4cb5 debug: more verbose claude CLI diagnostics 2026-03-23 07:15:24 -07:00
Garry Tan 50681f9fd9 ci: add workflow_dispatch trigger for manual runs 2026-03-23 07:11:48 -07:00
Garry Tan 57ca0107bc debug: add claude CLI smoke test step to diagnose exit_code_1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 07:05:55 -07:00
Garry Tan fe574f1cd8 fix: symlink node_modules instead of hardlink (cross-device)
Docker image layers and workspace are on different filesystems,
so cp -al (hardlink) fails. Use ln -s (symlink) instead — zero
copy overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 06:56:32 -07:00
Garry Tan ce31cb1853 fix: bun.lock is gitignored — use package.json only for Docker cache
bun.lock is in .gitignore so it doesn't exist after checkout.
Dockerfile and workflows now use package.json only for deps caching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 06:52:27 -07:00
Garry Tan c44d8d486b fix: bun.lockb → bun.lock + auth before manifest check
This project uses bun.lock (text format), not bun.lockb (binary).
Also move Docker login before manifest inspect so GHCR auth works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 06:48:25 -07:00
Garry Tan f1697f6678 chore: self-bootstrapping CI — build Docker image inline, cache by content hash
Move Docker image build into the evals workflow as a dependency job.
Image tag is keyed on hash of Dockerfile+lockfile+package.json — only
rebuilds when those change. Eliminates chicken-and-egg problem where
the image must exist before the first PR run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 06:45:24 -07:00
Garry Tan a5c70977f1 feat: parallelize CI evals — 12 runners (1 per suite) for ~3min wall clock
Switch eval workflow to use Docker container image with pre-baked
toolchain. Each of 12 matrix runners pulls the image, hardlinks
cached node_modules, builds browse, and runs one test suite.
Setup drops from ~70s to ~19s per runner. Wall clock is dominated
by the slowest individual test, not sequential sum.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 05:37:59 -07:00
Garry Tan 1d7e79f7c3 feat: parallelize CI evals — 12 runners (1 per suite) for ~3min wall clock
Matrix strategy spins up 12 ubicloud-standard-2 runners simultaneously,
one per test file. Separate report job aggregates all artifacts into a
single PR comment. Bun dependency cache cuts install from ~30s to ~3s.

Runner cost: ~$0.048 (from $0.024) — negligible vs $3-4 API costs.
Wall clock: ~3-4min (from ~8min).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 05:25:13 -07:00
Garry Tan 9195b671e8 chore: optimize CI eval PR comment — aggregate all suites, update-not-duplicate
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 05:21:43 -07:00
Garry Tan e589a95116 feat: add CI eval workflow on Ubicloud runners
Single-job GitHub Actions workflow that runs E2E evals on every PR using
Ubicloud runners ($0.006/run — 10x cheaper than GitHub standard). Uses
EVALS_CONCURRENCY=40 with the new within-file concurrency for ~6min
wall clock. Downloads previous eval artifact from main for comparison,
uploads results, and posts a PR comment with pass/fail + cost.

Ubicloud setup required: connect GitHub repo via ubicloud.com dashboard,
add ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY as repo secrets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 23:06:13 -07:00