Commit Graph

24 Commits

Author SHA1 Message Date
tdurieux b37a814f3a improve queue 2026-05-07 14:58:36 +03:00
tdurieux 7a163f2d35 Fix streamer crash and misclassified transient GitHub errors
Add missing error handler on the anonymizer transform stream in the
streamer route — without it, an upstream error tears down the pipe and
the anonymizer emits an unhandled error that crashes the process
(surfacing as ECONNRESET to the main server).

Classify transient network errors (ReadError, ECONNRESET, ETIMEDOUT)
as upstream_error/502 instead of file_not_found/404 so they are
distinguishable in logs and don't cache-poison downstream.

Update handleError tests to match the existing sanitization behavior
that returns internal_error for non-AnonymousError instances.
2026-05-07 07:44:15 +03:00
tdurieux 67cb2538b1 fix old github download repos 2026-05-06 19:37:16 +03:00
tdurieux da78708b7b Improve error handling 2026-05-06 18:43:36 +03:00
tdurieux 3613c895c8 improve logging 2026-05-06 16:31:10 +03:00
tdurieux 371693dc3b fix: end stream for files <= 150 bytes in resolveLfsPointer
The LFS-pointer probe buffered up to 150 bytes before deciding whether
to forward the blob or swap to the raw URL. For blobs that fit entirely
in the probe, decide() ran from the source's end event and attached
data/end listeners to an already-ended stream, so out.end() was never
called. The response hung until upstream timed out and storage.write
left an incomplete cached copy, which then forced a re-fetch on every
subsequent read.

Pass a sourceEnded flag through decide() and end the output directly
when the source has already finished. Also skip the GitHub blob fetch
when the tree size is already over MAX_FILE_SIZE, surfacing
file_too_big instead of a translated 422.
2026-05-06 13:10:25 +03:00
tdurieux 7dd6d872e9 Improve error handling 2026-05-06 12:44:07 +03:00
tdurieux e1cf8e3a00 Improve error handling 2026-05-06 12:26:12 +03:00
tdurieux c2d43164d0 error logging improvement, regex fix 2026-05-06 11:16:12 +03:00
tdurieux 27d6b56da7 fix mulitple bugs 2026-05-05 12:34:03 +03:00
tdurieux f413a30313 fix(cache): make Zip-source caches atomic and robust to partial state
Follow-up to the GitHubStream cache fixes. The same poisoned-cache
class existed in the GitHubDownload path and a few related spots:

- GitHubDownload.download: wipe pre-existing state before extracting
  and write a .anon-complete marker only after a successful extract.
  On error, rm the partial cache so a retry starts clean. getFileContent
  and getFiles now gate on the marker instead of "any file/folder
  exists," so a half-extracted tree can never be served as canonical.
- GitHubDownload.getFileContent: validate cached file size against the
  upstream FileModel size (via the new AnonymizedFile.size()), same
  guard as GitHubStream. getFiles filters the marker from the listing.
- FileSystem.listFiles: drop the bogus stats.ino.toString() as sha.
  An inode isn't a content hash; anything comparing it to a Git blob
  sha would silently disagree. Leave undefined.
- S3.write: remove the fire-and-forget data.on("error") -> this.rm(...).
  Multipart Upload doesn't commit partial objects, so there was nothing
  to clean up, and the handler raced retries and could delete a
  previously-good object on a transient source-stream hiccup. The
  size-validated read path recovers from any other undersized objects.
- GitHubStream.resolveLfsPointer: drop the post-decision early-return
  in blobStream.on("error"). Currently redundant with the inner
  listener, but removes the future-refactor footgun.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 08:54:42 +03:00
tdurieux 9adff11e74 fix(cache): atomic file writes and size-validated cache reads
A failed/interrupted GitHub fetch could leave a 0-byte or truncated
file in the local cache. Subsequent reads happily streamed the empty
content as the file's body — visible to users as an "Empty file" with
HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694).

- FileSystem.write: stream into a sibling .tmp and rename into place
  only on finish. Stream errors discard the tmp and leave any prior
  cached file untouched. Drop the utf-8 encoding that was silently
  corrupting binary blobs.
- GitHubStream.getFileContentCache: accept an expected size and treat
  cached.size < expected as a poisoned cache (truncated fetch) → rm
  and re-fetch. cached.size >= expected is accepted, which keeps
  Git LFS-resolved files (whose FileModel.size is the pointer size)
  working.
- AnonymizedFile: expose size() and pass it through to the streamer
  alongside sha so the cache check has the upstream size.

Existing poisoned entries self-heal on next access.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 08:47:41 +03:00
tdurieux f0f6436370 feat: resolve Git LFS pointers via the raw URL endpoint
Files tracked by Git LFS used to come out as the pointer text:

    version https://git-lfs.github.com/spec/v1
    oid sha256:...
    size ...

…because GitHub's blob API returns the pointer, not the resolved
content. Detect that prefix on the first ~150 bytes of the blob stream
and switch to a fresh fetch via the web raw URL
(github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects
to media.githubusercontent.com and resolves the LFS object — auth
header carries through. Non-LFS files are forwarded through the
existing pipeline unchanged.

Fixes #95.
2026-05-04 12:18:55 +02:00
tdurieux a5f66d6844 multiple fixes 2026-05-03 15:30:54 +02:00
Thomas Durieux f4209110c7 Fix all 93 ESLint issues (3 errors, 90 warnings) (#666) 2026-04-15 09:04:22 +02:00
Thomas Durieux 655ae92c4c Remove OpenTelemetry tracing infrastructure (#662) 2026-04-15 04:39:08 +02:00
Thomas Durieux f3641c8ce3 Set up CI with ESLint linter and Mocha test runner (#661) 2026-04-15 04:34:03 +02:00
tdurieux dcf483ea03 feat: improve download anonymized repository 2024-05-06 11:52:32 +02:00
tdurieux 17abc47d08 fix: fix webview on root repo 2024-04-28 08:08:39 +01:00
tdurieux a86e050f8b fix: handle empty repository 2024-04-26 13:48:32 +01:00
tdurieux 710f7328e7 feat: flatten file tree for better performance 2024-04-26 10:32:09 +01:00
tdurieux 1d4bab7866 fix: fix webview & improve download progress 2024-04-03 18:25:33 +01:00
tdurieux db67f53b2c fix: fix GitHubDownload 2024-04-03 13:24:34 +01:00
tdurieux 4d12641c7e feat: introduce streamers that handle the stream and anonymization from github 2024-04-03 11:13:01 +01:00