feat: test coverage catalog + failure triage (merged branches) (#285)

* feat: add bin/gstack-repo-mode — solo vs collaborative detection with caching

Detects whether a repo is solo-dev (one person does 80%+ of recent commits)
or collaborative. Uses 90-day git shortlog window with 7-day cache in
~/.gstack/projects/{SLUG}/repo-mode.json. Config override via
`gstack-config set repo_mode solo|collaborative` takes precedence over
the heuristic. Minimum 5 commits required to classify (otherwise unknown).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: test failure ownership triage — see something say something

Adds two new preamble sections to all gstack skills:
- Repo Ownership Mode: explains solo vs collaborative behavior
- See Something, Say Something: proactive issue flagging principle

Adds {{TEST_FAILURE_TRIAGE}} template variable (opt-in, used by /ship):
- Classifies test failures as in-branch vs pre-existing
- Solo mode defaults to "investigate and fix now"
- Collaborative mode offers "blame + assign GitHub issue" option
- Also offers P0 TODO and skip options

/ship Step 3 now triages test failures instead of hard-stopping on all
failures. In-branch failures still block shipping. Pre-existing failures
get user-directed triage based on repo mode.

Adds P2 TODO for gstack notes system (deferred lightweight reminder).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for Claude and Codex hosts

All 22 Claude skills and 21 Codex skills regenerated with new preamble
sections (Repo Ownership Mode, See Something Say Something) and
{{TEST_FAILURE_TRIAGE}} resolved in ship/SKILL.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: validate repo mode values to prevent shell injection

Codex adversarial review found that unvalidated config/cache values
could be injected into shell via source <(gstack-repo-mode). Added
validate_mode() that only allows solo|collaborative|unknown — anything
else becomes "unknown". Prevents persistent code execution through
malicious config.yaml or tampered cache JSON.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: shell injection via branch names + feature-branch sampling bias

Codex code review found two issues:

P1: eval $(gstack-slug) in gstack-repo-mode executes branch names as
shell. Branch names like foo$(touch${IFS}pwned) are valid git refs and
would execute arbitrary commands. Fix: compute SLUG directly with sed
instead of eval'ing gstack-slug output.

P2: git shortlog HEAD only sees current branch history. On feature
branches that haven't merged main recently, other contributors disappear
from the sample. Fix: use git shortlog on the default branch
(origin/main) instead of HEAD.

Also improved blame lookup in collaborative triage to check both the
test file and the production code it covers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: broaden codex-host stripping test to accommodate triage section

"Investigate and fix" now appears in TEST_FAILURE_TRIAGE (not just the
Codex review step). Use CODEX_REVIEWS config string as a more specific
marker for detecting the Codex review step in Codex-hosted skills.

* fix: replace template placeholder in TODOS.md with readable text

{{TEST_FAILURE_TRIAGE}} is template syntax but TODOS.md is not processed
by gen-skill-docs — replaced with human-readable reference.

* chore: bump version and changelog (v0.9.5.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add bin/ directory to project structure in CLAUDE.md

* test: add triage resolver unit tests, plan-eng coverage audit E2E, and triage E2E

- TEST_FAILURE_TRIAGE resolver: 6 unit tests verifying all triage steps (T1-T4),
  REPO_MODE branching, and safety default for ambiguous failures
- plan-eng-coverage-audit E2E: tests /plan-eng-review coverage audit codepath
  (gap identified during eng review — existed on neither branch)
- ship-triage E2E: planted-bug fixture with in-branch (truncate null) and
  pre-existing (divide-by-zero) failures; verifies correct classification
- Touchfile entries for diff-based test selection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate stale Codex SKILL.md for retro

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-21 09:36:17 -07:00
committed by GitHub
parent 624c7793de
commit f954f3449a
43 changed files with 1232 additions and 18 deletions
+42 -2
View File
@@ -520,6 +520,46 @@ describe('TEST_COVERAGE_AUDIT placeholders', () => {
});
});
// --- {{TEST_FAILURE_TRIAGE}} resolver tests ---
describe('TEST_FAILURE_TRIAGE resolver', () => {
const shipSkill = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
test('contains all 4 triage steps', () => {
expect(shipSkill).toContain('Step T1: Classify each failure');
expect(shipSkill).toContain('Step T2: Handle in-branch failures');
expect(shipSkill).toContain('Step T3: Handle pre-existing failures');
expect(shipSkill).toContain('Step T4: Execute the chosen action');
});
test('T1 includes classification criteria (in-branch vs pre-existing)', () => {
expect(shipSkill).toContain('In-branch');
expect(shipSkill).toContain('Likely pre-existing');
expect(shipSkill).toContain('git diff origin/');
});
test('T3 branches on REPO_MODE (solo vs collaborative)', () => {
expect(shipSkill).toContain('REPO_MODE');
expect(shipSkill).toContain('solo');
expect(shipSkill).toContain('collaborative');
});
test('solo mode offers fix-now, TODO, and skip options', () => {
expect(shipSkill).toContain('Investigate and fix now');
expect(shipSkill).toContain('Add as P0 TODO');
expect(shipSkill).toContain('Skip');
});
test('collaborative mode offers blame + assign option', () => {
expect(shipSkill).toContain('Blame + assign GitHub issue');
expect(shipSkill).toContain('gh issue create');
});
test('defaults ambiguous failures to in-branch (safety)', () => {
expect(shipSkill).toContain('When ambiguous, default to in-branch');
});
});
// --- {{SPEC_REVIEW_LOOP}} resolver tests ---
describe('SPEC_REVIEW_LOOP resolver', () => {
@@ -691,11 +731,11 @@ describe('Codex generation (--host codex)', () => {
test('Codex review step stripped from Codex-host ship and review', () => {
const shipContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-ship', 'SKILL.md'), 'utf-8');
expect(shipContent).not.toContain('codex review --base');
expect(shipContent).not.toContain('Investigate and fix');
expect(shipContent).not.toContain('CODEX_REVIEWS');
const reviewContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-review', 'SKILL.md'), 'utf-8');
expect(reviewContent).not.toContain('codex review --base');
expect(reviewContent).not.toContain('Investigate and fix');
expect(reviewContent).not.toContain('CODEX_REVIEWS');
});
test('--host codex --dry-run freshness', () => {
+5 -3
View File
@@ -68,7 +68,7 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'plan-eng-review-artifact': ['plan-eng-review/**'],
// Ship
'ship-base-branch': ['ship/**'],
'ship-base-branch': ['ship/**', 'bin/gstack-repo-mode'],
// Retro
'retro': ['retro/**'],
@@ -91,9 +91,11 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
// QA bootstrap
'qa-bootstrap': ['qa/**', 'browse/src/**', 'ship/**'],
// Coverage audit (shared fixture)
'ship-coverage-audit': ['ship/**', 'test/fixtures/coverage-audit-fixture.ts'],
// Coverage audit (shared fixture) + triage
'ship-coverage-audit': ['ship/**', 'test/fixtures/coverage-audit-fixture.ts', 'bin/gstack-repo-mode'],
'review-coverage-audit': ['review/**', 'test/fixtures/coverage-audit-fixture.ts'],
'plan-eng-coverage-audit': ['plan-eng-review/**', 'test/fixtures/coverage-audit-fixture.ts'],
'ship-triage': ['ship/**', 'bin/gstack-repo-mode'],
// Design
'design-consultation-core': ['design-consultation/**'],
+243
View File
@@ -2858,6 +2858,249 @@ Output the diagram directly.`,
}, 180_000);
});
// --- Plan Eng Review Coverage Audit E2E ---
describeIfSelected('Plan Eng Review Coverage Audit E2E', ['plan-eng-coverage-audit'], () => {
let planCoverageDir: string;
beforeAll(() => {
planCoverageDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-coverage-'));
// Copy plan-eng-review skill files
copyDirSync(path.join(ROOT, 'plan-eng-review'), path.join(planCoverageDir, 'plan-eng-review'));
// Use shared fixture for billing project with coverage gaps
const { createCoverageAuditFixture } = require('./fixtures/coverage-audit-fixture');
createCoverageAuditFixture(planCoverageDir);
});
afterAll(() => {
try { fs.rmSync(planCoverageDir, { recursive: true, force: true }); } catch {}
});
test('/plan-eng-review coverage audit traces plan codepaths', async () => {
const result = await runSkillTest({
prompt: `Read the file plan-eng-review/SKILL.md for the plan review workflow instructions.
You are on the feature/billing branch. The base branch is main.
This is a test project there is no remote, no PR to create.
ONLY run the Test Coverage Audit section from the plan review workflow.
Skip all other steps (architecture, code quality, performance, etc.).
The source code is in ${planCoverageDir}/src/billing.ts.
Existing tests are in ${planCoverageDir}/test/billing.test.ts.
Produce the ASCII coverage diagram showing which code paths are tested and which have gaps.
Output the diagram directly.`,
workingDirectory: planCoverageDir,
maxTurns: 15,
allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'],
timeout: 120_000,
testName: 'plan-eng-coverage-audit',
runId,
});
logCost('/plan-eng-review coverage audit', result);
recordE2E('/plan-eng-review coverage audit', 'Plan Eng Review Coverage Audit E2E', result, {
passed: result.exitReason === 'success',
});
expect(result.exitReason).toBe('success');
// Check output contains coverage diagram elements
const output = result.output || '';
const outputLower = output.toLowerCase();
const hasGap = outputLower.includes('gap') || outputLower.includes('no test');
const hasTested = outputLower.includes('tested') || output.includes('✓') || output.includes('★');
const hasCoverage = outputLower.includes('coverage') || outputLower.includes('paths tested');
console.log(`Output has GAP markers: ${hasGap}`);
console.log(`Output has TESTED markers: ${hasTested}`);
console.log(`Output has coverage summary: ${hasCoverage}`);
// The agent MUST produce a coverage diagram with gap and tested markers
expect(hasGap || hasTested).toBe(true);
// At minimum, the agent should have read the source and test files
const readCalls = result.toolCalls.filter(tc => tc.tool === 'Read');
expect(readCalls.length).toBeGreaterThan(0);
}, 180_000);
});
// --- Triage E2E ---
describeIfSelected('Test Failure Triage E2E', ['ship-triage'], () => {
let triageDir: string;
beforeAll(() => {
triageDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-triage-'));
// Copy ship skill files
copyDirSync(path.join(ROOT, 'ship'), path.join(triageDir, 'ship'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: triageDir, stdio: 'pipe', timeout: 5000 });
// Init git repo
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a project with a pre-existing test failure on main
fs.writeFileSync(path.join(triageDir, 'package.json'), JSON.stringify({
name: 'triage-test-app',
version: '1.0.0',
scripts: { test: 'node test/run.js' },
}, null, 2));
fs.mkdirSync(path.join(triageDir, 'src'), { recursive: true });
fs.mkdirSync(path.join(triageDir, 'test'), { recursive: true });
// Source with a bug that exists on main (pre-existing)
fs.writeFileSync(path.join(triageDir, 'src', 'math.js'), `
module.exports = {
add: (a, b) => a + b,
divide: (a, b) => a / b, // BUG: no zero-division check (pre-existing)
};
`);
// Test file that catches the pre-existing bug
fs.writeFileSync(path.join(triageDir, 'test', 'math.test.js'), `
const { add, divide } = require('../src/math');
// This test passes
if (add(2, 3) !== 5) { console.error('FAIL: add(2,3) should be 5'); process.exit(1); }
console.log('PASS: add');
// This test FAILS — pre-existing bug (divide by zero returns Infinity, not an error)
try {
const result = divide(10, 0);
if (result === Infinity) { console.error('FAIL: divide(10,0) should throw, got Infinity'); process.exit(1); }
} catch(e) {
console.log('PASS: divide zero check');
}
`);
// Test runner
fs.writeFileSync(path.join(triageDir, 'test', 'run.js'), `
require('./math.test.js');
require('./string.test.js');
`);
// Commit on main with the pre-existing bug
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial: math utils with tests']);
// Create feature branch
run('git', ['checkout', '-b', 'feature/string-utils']);
// Add new code with a new bug (in-branch)
fs.writeFileSync(path.join(triageDir, 'src', 'string.js'), `
module.exports = {
capitalize: (s) => s.charAt(0).toUpperCase() + s.slice(1),
reverse: (s) => s.split('').reverse().join(''),
truncate: (s, len) => s.substring(0, len), // BUG: no null check (in-branch)
};
`);
// Add test that catches the in-branch bug
fs.writeFileSync(path.join(triageDir, 'test', 'string.test.js'), `
const { capitalize, reverse, truncate } = require('../src/string');
if (capitalize('hello') !== 'Hello') { console.error('FAIL: capitalize'); process.exit(1); }
console.log('PASS: capitalize');
if (reverse('abc') !== 'cba') { console.error('FAIL: reverse'); process.exit(1); }
console.log('PASS: reverse');
// This test FAILS — in-branch bug (null input causes TypeError)
try {
truncate(null, 5);
console.log('PASS: truncate null');
} catch(e) {
console.error('FAIL: truncate(null, 5) threw: ' + e.message);
process.exit(1);
}
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'feat: add string utilities']);
});
afterAll(() => {
try { fs.rmSync(triageDir, { recursive: true, force: true }); } catch {}
});
test('/ship triage correctly classifies in-branch vs pre-existing failures', async () => {
const result = await runSkillTest({
prompt: `Read the file ship/SKILL.md for the ship workflow instructions.
You are on the feature/string-utils branch. The base branch is main.
This is a test project there is no remote, no PR to create.
Run the tests first:
\`\`\`bash
cd ${triageDir} && node test/run.js
\`\`\`
The tests will fail. Now run ONLY the Test Failure Ownership Triage (Steps T1-T4) from the ship workflow.
For each failing test, classify it as:
- **In-branch**: caused by changes on this branch (feature/string-utils)
- **Pre-existing**: existed before this branch (present on main)
Use git diff origin/main...HEAD (or git diff main...HEAD since there's no remote) to determine which files changed on this branch.
Output your classification for each failure clearly, labeling each as "IN-BRANCH" or "PRE-EXISTING" with your reasoning.
This is a solo repo (REPO_MODE=solo). For pre-existing failures, recommend fixing now.`,
workingDirectory: triageDir,
maxTurns: 20,
allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'],
timeout: 180_000,
testName: 'ship-triage',
runId,
});
logCost('/ship triage', result);
const output = result.output || '';
const outputLower = output.toLowerCase();
// The triage should identify the string/truncate failure as in-branch
const hasInBranch = outputLower.includes('in-branch') || outputLower.includes('in branch') || outputLower.includes('introduced');
// The triage should identify the math/divide failure as pre-existing
const hasPreExisting = outputLower.includes('pre-existing') || outputLower.includes('pre existing') || outputLower.includes('existed before');
console.log(`Output identifies IN-BRANCH failures: ${hasInBranch}`);
console.log(`Output identifies PRE-EXISTING failures: ${hasPreExisting}`);
// Check that the string/truncate bug is classified as in-branch
const mentionsTruncate = outputLower.includes('truncate') || outputLower.includes('string');
const mentionsDivide = outputLower.includes('divide') || outputLower.includes('math');
console.log(`Mentions truncate/string (in-branch bug): ${mentionsTruncate}`);
console.log(`Mentions divide/math (pre-existing bug): ${mentionsDivide}`);
recordE2E('/ship triage', 'Test Failure Triage E2E', result, {
passed: result.exitReason === 'success' && hasInBranch && hasPreExisting,
has_in_branch_classification: hasInBranch,
has_pre_existing_classification: hasPreExisting,
mentions_truncate: mentionsTruncate,
mentions_divide: mentionsDivide,
});
expect(result.exitReason).toBe('success');
// Must classify at least one failure as in-branch AND one as pre-existing
expect(hasInBranch).toBe(true);
expect(hasPreExisting).toBe(true);
// Must mention the specific bugs
expect(mentionsTruncate).toBe(true);
expect(mentionsDivide).toBe(true);
}, 240_000);
});
// --- Codex skill E2E ---
describeIfSelected('Codex skill E2E', ['codex-review'], () => {
+57 -2
View File
@@ -1280,12 +1280,12 @@ describe('Codex skill', () => {
test('codex-host ship/review do NOT contain codex review step', () => {
const shipContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-ship', 'SKILL.md'), 'utf-8');
expect(shipContent).not.toContain('codex review --base');
expect(shipContent).not.toContain('Investigate and fix');
expect(shipContent).not.toContain('CODEX_REVIEWS');
const reviewContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-review', 'SKILL.md'), 'utf-8');
expect(reviewContent).not.toContain('codex review --base');
expect(reviewContent).not.toContain('codex_reviews');
expect(reviewContent).not.toContain('Investigate and fix');
expect(reviewContent).not.toContain('CODEX_REVIEWS');
});
test('codex integration in /plan-eng-review offers plan critique', () => {
@@ -1411,3 +1411,58 @@ describe('Codex skill validation', () => {
}
});
});
// --- Repo mode and test failure triage validation ---
describe('Repo mode preamble validation', () => {
test('generated SKILL.md preamble contains REPO_MODE output', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
expect(content).toContain('REPO_MODE:');
expect(content).toContain('gstack-repo-mode');
});
test('generated SKILL.md contains See Something Say Something section', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
expect(content).toContain('See Something, Say Something');
expect(content).toContain('REPO_MODE');
expect(content).toContain('solo');
expect(content).toContain('collaborative');
});
});
describe('Test failure triage in ship skill', () => {
test('ship/SKILL.md contains Test Failure Ownership Triage', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Test Failure Ownership Triage');
});
test('ship/SKILL.md triage uses git diff for classification', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('git diff origin/<base>...HEAD --name-only');
});
test('ship/SKILL.md triage has solo and collaborative paths', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('REPO_MODE');
expect(content).toContain('solo');
expect(content).toContain('collaborative');
expect(content).toContain('Investigate and fix now');
expect(content).toContain('Add as P0 TODO');
});
test('ship/SKILL.md triage has GitHub issue assignment for collaborative mode', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('gh issue create');
expect(content).toContain('--assignee');
});
test('{{TEST_FAILURE_TRIAGE}} placeholder is fully resolved in ship/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).not.toContain('{{TEST_FAILURE_TRIAGE}}');
});
test('ship/SKILL.md uses in-branch language for stop condition', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('In-branch test failures');
});
});