Claude Code CLI blocks --dangerously-skip-permissions when running
as uid=0 for security. Add a 'runner' user to the Docker image and
set --user runner on the container.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker image layers and workspace are on different filesystems,
so cp -al (hardlink) fails. Use ln -s (symlink) instead — zero
copy overhead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bun.lock is in .gitignore so it doesn't exist after checkout.
Dockerfile and workflows now use package.json only for deps caching.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This project uses bun.lock (text format), not bun.lockb (binary).
Also move Docker login before manifest inspect so GHCR auth works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move Docker image build into the evals workflow as a dependency job.
Image tag is keyed on hash of Dockerfile+lockfile+package.json — only
rebuilds when those change. Eliminates chicken-and-egg problem where
the image must exist before the first PR run.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch eval workflow to use Docker container image with pre-baked
toolchain. Each of 12 matrix runners pulls the image, hardlinks
cached node_modules, builds browse, and runs one test suite.
Setup drops from ~70s to ~19s per runner. Wall clock is dominated
by the slowest individual test, not sequential sum.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matrix strategy spins up 12 ubicloud-standard-2 runners simultaneously,
one per test file. Separate report job aggregates all artifacts into a
single PR comment. Bun dependency cache cuts install from ~30s to ~3s.
Runner cost: ~$0.048 (from $0.024) — negligible vs $3-4 API costs.
Wall clock: ~3-4min (from ~8min).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-job GitHub Actions workflow that runs E2E evals on every PR using
Ubicloud runners ($0.006/run — 10x cheaper than GitHub standard). Uses
EVALS_CONCURRENCY=40 with the new within-file concurrency for ~6min
wall clock. Downloads previous eval artifact from main for comparison,
uploads results, and posts a PR comment with pass/fail + cost.
Ubicloud setup required: connect GitHub repo via ubicloud.com dashboard,
add ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY as repo secrets.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>