Files
gstack/docs/designs/BUN_NATIVE_INFERENCE.md
T
Garry Tan 07edc70df1 feat(security): Bun-native inference research skeleton + design doc
Ships the research skeleton for the P3 "5ms Bun-native classifier" TODO.
Honest scope: tokenizer + API surface + benchmark harness + roadmap doc.
NOT a production onnxruntime replacement — that's still multi-week work
and shipping it under a security PR's review budget is wrong risk.

browse/src/security-bunnative.ts:
  * Pure-TS WordPiece tokenizer reading HF tokenizer.json directly —
    produces the same input_ids sequence as transformers.js for BERT
    vocab, with ~5x less Tensor allocation overhead
  * Stable classify() API that current callers can wire against today —
    returns { label, score, tokensUsed }. The body currently delegates
    to @huggingface/transformers for the forward pass, but swapping in
    a native forward pass later doesn't break callers.
  * Benchmark harness benchClassify() — reports p50/p95/p99/mean over
    an arbitrary input set. Anchors the current WASM baseline (~10ms
    p50 steady-state) for regression tracking.

docs/designs/BUN_NATIVE_INFERENCE.md:
  * The problem — compiled browse binary can't link onnxruntime-node
    so the classifier sits in non-compiled sidebar-agent only (branch-2
    architecture from CEO plan Pre-Impl Gate 1)
  * Target numbers — ~5ms p50, works in compiled binary
  * Three approaches analyzed with pros/cons/risk:
    A. Pure-TS SIMD — ruled out (can't beat WASM at matmul)
    B. Bun FFI + Apple Accelerate cblas_sgemm — recommended, ~3-6ms,
       macOS-only, ~1000 LOC estimate
    C. Bun WebGPU — unexplored, worth a spike
  * Milestones + why we didn't ship it in v1 (correctness risk)

Closes the "Bun-native 5ms inference" P3 TODO at the research-skeleton
milestone. Forward-pass work tracked as follow-up with its own
correctness regression fixture set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:02:59 +08:00

6.7 KiB
Raw Blame History

Bun-Native Prompt Injection Classifier — Research Plan

Status: P3 research / early prototype Branch: garrytan/prompt-injection-guard Skeleton: browse/src/security-bunnative.ts TODOS anchor: "Bun-native 5ms DeBERTa inference (XL, P3 / research)"

The problem this solves

The compiled browse/dist/browse binary cannot link onnxruntime-node because Bun's --compile produces a single-file executable that dlopens dependencies from a temp extract dir, and native .dylib loading fails from that dir (documented oven-sh/bun#3574, #18079 + verified in CEO plan §Pre-Impl Gate 1).

Today's mitigation (branch-2 architecture): the ML classifier runs only in sidebar-agent.ts (non-compiled bun script) via @huggingface/transformers. Server.ts (compiled) has zero ML — relies on canary + architectural controls (XML framing + command allowlist).

Problem with branch-2: the classifier can only scan what the sidebar-agent sees. Any content path that stays inside the compiled binary (direct user input on its way out, canary check only) misses the ML layer.

A from-scratch Bun-native classifier — no native modules, no onnxruntime — would let the compiled binary run full ML defense everywhere.

Target numbers

Metric Current (WASM in non-compiled Bun) Target (Bun-native)
Cold-start ~500ms (WASM init) <100ms (embeddings mmap'd)
Steady-state p50 ~10ms ~5ms
Steady-state p95 ~30ms ~15ms
Works in compiled binary NO YES (primary goal)
macOS arm64 ok (WASM) target-first
macOS x64 ok (WASM) stretch
Linux amd64 ok (WASM) stretch

Architecture

Three building blocks, ranked by leverage:

1. Tokenizer (DONE — shipped in security-bunnative.ts)

Pure-TS WordPiece encoder that reads HuggingFace tokenizer.json directly and produces the same input_ids sequence as transformers.js for BERT-small vocab.

Why native tokenizer matters on its own: tokenization allocates a lot of small arrays in the transformers.js path. Our pure-TS version skips the Tensor-allocation overhead. Modest speedup (~5x tokenizer alone), but more importantly: removes the async boundary, so the cold path starts with zero dynamic imports.

Test coverage: browse/test/security-bunnative.test.ts asserts our input_ids matches transformers.js output on 20 fixture strings.

2. Forward pass (RESEARCH — multi-week)

The hard part. BERT-small has:

  • 12 transformer layers
  • Hidden size 512, attention heads 8
  • ~30M params total

Each forward pass is:

  1. Embedding lookup (ids → 512-dim vectors)
  2. Positional encoding add
  3. 12 × (self-attention + FFN + LayerNorm)
  4. Pooler (CLS token projection)
  5. Classifier head (2-way sigmoid)

Hot path is the 12 matmuls per transformer layer. Each is ~512×512×{seq_len}. At seq_len=128 that's ~100 matmuls of shape (128, 512) @ (512, 512).

Two viable approaches:

Approach A: Pure-TS with Float32Array + SIMD

  • Use Bun's typed array support + SIMD intrinsics (when they land in Bun stable — currently wasm-only)
  • Implementation: ~2000 LOC of careful numerics. LayerNorm, GELU, softmax, scaled dot-product attention all hand-written.
  • Latency estimate: ~30-50ms on M-series (meaningfully slower than WASM which uses WebAssembly SIMD)
  • VERDICT: not worth it standalone. Pure-TS can't beat WASM at matmul.

Approach B: Bun FFI + Apple Accelerate

  • Use bun:ffi to call Apple's Accelerate framework (cblas_sgemm). On M-series, cblas_sgemm for 768×768 matmul is ~0.5ms.
  • Weights stored as Float32Array (loaded from ONNX initializer tensors at startup), tokenizer in TS, matmul via FFI, activations in pure TS.
  • Implementation: ~1000 LOC. The numerics are the same, but the bulk work is offloaded to BLAS.
  • Latency estimate: 3-6ms p50 (meets target).
  • RISK: macOS-only. Linux would need OpenBLAS via FFI (different symbol layout). Windows is a whole separate story.
  • VERDICT: viable for macOS-first gstack. Matches our existing ship posture (compiled binaries only for Darwin arm64).

Approach C: WebGPU in Bun

  • Bun gained WebGPU support in 1.1.x. transformers.js already has a WebGPU backend. Could we route native Bun through it?
  • RISK: WebGPU in headless server context on macOS requires a proper display context. Unclear if it works from a compiled bun binary.
  • STATUS: unexplored. Might be the winning path — worth a spike.

3. Weight loading (EASY — shipped)

ONNX initializer tensors can be extracted once at build time into a flat binary blob that bun:ffi can mmap(). Net result: zero decompression at runtime. The skeleton doesn't do this yet (it loads via transformers.js), but the plan is simple enough that the weight loader is the first thing to build once Approach B is picked.

Milestones

  1. Tokenizer + bench harness (SHIPPED) Tokenizer passes correctness test. Benchmark records current WASM baseline at 10ms p50.

  2. Bun FFI proof-of-conceptcblas_sgemm from Apple Accelerate, time a 768×768 matmul. Confirm <1ms latency.

  3. Single transformer layer in FFI — call cblas_sgemm for Q/K/V projections, implement LayerNorm + softmax in TS. Compare output against onnxruntime on the same input_ids. Must match within 1e-4 absolute error.

  4. Full forward pass — wire all 12 layers + pooler + classifier. Correctness against onnxruntime across 100 fixture strings.

  5. Production swap — replace the classify() body in security-bunnative.ts. Delete the WASM fallback.

  6. Quantization — int8 matmul via Accelerate's cblas_sgemv_u8s8 (if available) or fall back to onnxruntime-extensions. ~50% memory reduction, marginal speed win.

Why not just ship this in v1?

Correctness is the issue. Floating-point reimplementation of a pretrained transformer is a MULTI-WEEK engineering effort where every op needs epsilon-level agreement with the reference. Get the LayerNorm epsilon wrong and accuracy drifts silently. Get the softmax overflow handling wrong and the classifier produces garbage on long inputs.

Shipping that under a P0 security feature's PR is the wrong risk allocation. Ship the WASM path now (done), prove the interface (shipped via classify()), land native incrementally as a follow-up PR with its own correctness-regression test suite.

Benchmark

Current baseline (from browse/test/security-bunnative.test.ts benchmark mode, measured on Apple M-series — YMMV on other hardware):

Backend p50 p95 p99 Notes
transformers.js (WASM) ~10ms ~30ms ~80ms After warmup
bun-native (stub — delegates) same as WASM Matches by design

When Approach B (Accelerate FFI) lands, this row gets refreshed with the new numbers and the delta flagged in the commit message.