gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-06-18 15:50:11 +02:00

Files

T

Garry Tan 06f25ca387 feat: LLM-as-judge evals for SKILL.md documentation quality

4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run):
- Command reference table: clarity/completeness/actionability >= 4/5
- Snapshot flags section: same thresholds
- browse/SKILL.md overall quality
- Regression: generated version must score >= hand-maintained baseline

Requires ANTHROPIC_API_KEY. Auto-skips without it.
Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts)

2026-03-13 15:59:11 -07:00

helpers

feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding

2026-03-13 15:43:41 -07:00

gen-skill-docs.test.ts

test: quality evals for generated SKILL.md descriptions

2026-03-13 15:56:24 -07:00

skill-e2e.test.ts

feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding

2026-03-13 15:43:41 -07:00

skill-llm-eval.test.ts

feat: LLM-as-judge evals for SKILL.md documentation quality

2026-03-13 15:59:11 -07:00

skill-parser.test.ts

test: Tier 1 static validation — 34 tests for SKILL.md command correctness

2026-03-13 15:43:34 -07:00

skill-validation.test.ts

test: Tier 1 static validation — 34 tests for SKILL.md command correctness

2026-03-13 15:43:34 -07:00