feat(windows): curated windows-free-tests CI job + test-free-shards curation

Codex's v1.18.0.0 review flagged that a windows-latest matrix entry on the
existing Linux-container evals.yml workflow can't work as a drop-in, and that
the free test suite has POSIX-bound dependencies a sharded runner doesn't fix
on its own. This commit takes McGluut's test-free-shards.ts (190 LOC), adds a
Windows-fragility scan, and runs the curated subset on a separate non-container
windows-latest job.

scripts/test-free-shards.ts:
- Enumeration + paid-eval filtering + stable-hash sharding (FNV-1a). Adapted
  from McGluut/gstack fork.
- Upstream-original: --windows-only filter scans each test's content for
  POSIX-bound patterns: hardcoded /bin/sh, spawn('sh', ...), bash -c, raw
  /tmp/, chmod, xargs, which claude. Files matching are excluded with the
  reason logged. Currently filters 25 of 128 free tests; remaining 103 run
  on windows-latest.

.github/workflows/windows-free-tests.yml:
- Separate non-container job (NOT a matrix entry on evals.yml). Runs:
    bun run test:windows                       # curated subset
    bun test browse/test/claude-bin.test.ts    # PATHEXT+overrides on Windows
    bun test test/gstack-paths.test.ts         # state-root resolution

package.json: new test:free + test:windows scripts.

Honest about scope (codex-flagged): this does NOT make the full free suite
Windows-safe. The 25 excluded tests need POSIX-only surfaces ported off shell
primitives (test/ship-version-sync.test.ts:72 hardcodes /bin/bash, etc).
Tracked as a P4 follow-up TODO. Full Windows parity is the next wave; this
release ships the curated lane.

Tests: test/test-free-shards.test.ts has 14 unit tests covering enumeration,
paid-eval filtering, Windows-fragility detection (POSIX patterns + safe code),
and stable sharding determinism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-27 23:02:08 -07:00
parent 87ce4c696f
commit 8745f89ad4
4 changed files with 471 additions and 1 deletions
+57
View File
@@ -0,0 +1,57 @@
name: Windows Free Tests
# Curated subset of the free test suite that runs on windows-latest.
#
# Codex's v1.18.0.0 review flagged that the existing evals.yml workflow uses
# a Linux container, so a windows-latest matrix entry there isn't a drop-in.
# This workflow is non-container, runs the curated Windows-safe subset, plus
# targeted resolver tests that exercise the Bun.which-based claude binary
# resolution + the GSTACK_CLAUDE_BIN override path on Windows.
#
# What this DOES NOT do (out of scope for v1.18.0.0):
# - Run the full free suite on Windows. The 24 tests that hardcode /bin/sh,
# spawn('sh',...), or raw /tmp/ paths are excluded by scripts/test-free-shards.ts
# --windows-only. They need POSIX-bound surfaces to be ported off shell
# primitives before they can run on Windows. Tracked as a follow-up TODO.
# - Run Playwright/browser-backed tests. Browse server bring-up on Windows is
# a separate concern (PR #1238 windows-pty-bun-pty-fix is in flight).
on:
pull_request:
branches: [main]
workflow_dispatch:
concurrency:
group: windows-free-${{ github.head_ref }}
cancel-in-progress: true
jobs:
windows-free-tests:
runs-on: windows-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v1
with:
bun-version: latest
- name: Install dependencies
run: bun install --frozen-lockfile
- name: Show curated subset (for build log audit trail)
run: bun run scripts/test-free-shards.ts --windows-only --list
shell: bash
- name: Run curated Windows-safe subset
run: bun run test:windows
shell: bash
- name: Targeted Claude resolver tests (real PATHEXT coverage on Windows)
run: bun test browse/test/claude-bin.test.ts
shell: bash
- name: gstack-paths helper test (resolves $GSTACK_STATE_ROOT etc. on Windows)
run: bun test test/gstack-paths.test.ts
shell: bash