gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-02 11:45:20 +02:00

Files

T

Garry Tan 9dffb1ed16 feat: LLM-as-judge evals for SKILL.md documentation quality

4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run):
- Command reference table: clarity/completeness/actionability >= 4/5
- Snapshot flags section: same thresholds
- browse/SKILL.md overall quality
- Regression: generated version must score >= hand-maintained baseline

Requires ANTHROPIC_API_KEY. Auto-skips without it.
Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts)

2026-03-13 20:23:31 -05:00

helpers

feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding

2026-03-13 20:23:18 -05:00

gen-skill-docs.test.ts

test: quality evals for generated SKILL.md descriptions

2026-03-13 20:23:30 -05:00

skill-e2e.test.ts

feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding

2026-03-13 20:23:18 -05:00

skill-llm-eval.test.ts

feat: LLM-as-judge evals for SKILL.md documentation quality

2026-03-13 20:23:31 -05:00

skill-parser.test.ts

test: Tier 1 static validation — 34 tests for SKILL.md command correctness

2026-03-13 20:23:18 -05:00

skill-validation.test.ts

test: Tier 1 static validation — 34 tests for SKILL.md command correctness

2026-03-13 20:23:18 -05:00