anonymous_github

mirror of https://github.com/tdurieux/anonymous_github.git synced 2026-05-15 22:48:00 +02:00

Author	SHA1	Message	Date
tdurieux	b37a814f3a	improve queue	2026-05-07 14:58:36 +03:00
tdurieux	7a163f2d35	Fix streamer crash and misclassified transient GitHub errors Add missing error handler on the anonymizer transform stream in the streamer route — without it, an upstream error tears down the pipe and the anonymizer emits an unhandled error that crashes the process (surfacing as ECONNRESET to the main server). Classify transient network errors (ReadError, ECONNRESET, ETIMEDOUT) as upstream_error/502 instead of file_not_found/404 so they are distinguishable in logs and don't cache-poison downstream. Update handleError tests to match the existing sanitization behavior that returns internal_error for non-AnonymousError instances.	2026-05-07 07:44:15 +03:00
tdurieux	67cb2538b1	fix old github download repos	2026-05-06 19:37:16 +03:00
tdurieux	da78708b7b	Improve error handling	2026-05-06 18:43:36 +03:00
tdurieux	3613c895c8	improve logging	2026-05-06 16:31:10 +03:00
tdurieux	371693dc3b	fix: end stream for files <= 150 bytes in resolveLfsPointer The LFS-pointer probe buffered up to 150 bytes before deciding whether to forward the blob or swap to the raw URL. For blobs that fit entirely in the probe, decide() ran from the source's end event and attached data/end listeners to an already-ended stream, so out.end() was never called. The response hung until upstream timed out and storage.write left an incomplete cached copy, which then forced a re-fetch on every subsequent read. Pass a sourceEnded flag through decide() and end the output directly when the source has already finished. Also skip the GitHub blob fetch when the tree size is already over MAX_FILE_SIZE, surfacing file_too_big instead of a translated 422.	2026-05-06 13:10:25 +03:00
tdurieux	7dd6d872e9	Improve error handling	2026-05-06 12:44:07 +03:00
tdurieux	e1cf8e3a00	Improve error handling	2026-05-06 12:26:12 +03:00
tdurieux	c2d43164d0	error logging improvement, regex fix	2026-05-06 11:16:12 +03:00
tdurieux	27d6b56da7	fix mulitple bugs	2026-05-05 12:34:03 +03:00
tdurieux	f413a30313	fix(cache): make Zip-source caches atomic and robust to partial state Follow-up to the GitHubStream cache fixes. The same poisoned-cache class existed in the GitHubDownload path and a few related spots: - GitHubDownload.download: wipe pre-existing state before extracting and write a .anon-complete marker only after a successful extract. On error, rm the partial cache so a retry starts clean. getFileContent and getFiles now gate on the marker instead of "any file/folder exists," so a half-extracted tree can never be served as canonical. - GitHubDownload.getFileContent: validate cached file size against the upstream FileModel size (via the new AnonymizedFile.size()), same guard as GitHubStream. getFiles filters the marker from the listing. - FileSystem.listFiles: drop the bogus stats.ino.toString() as sha. An inode isn't a content hash; anything comparing it to a Git blob sha would silently disagree. Leave undefined. - S3.write: remove the fire-and-forget data.on("error") -> this.rm(...). Multipart Upload doesn't commit partial objects, so there was nothing to clean up, and the handler raced retries and could delete a previously-good object on a transient source-stream hiccup. The size-validated read path recovers from any other undersized objects. - GitHubStream.resolveLfsPointer: drop the post-decision early-return in blobStream.on("error"). Currently redundant with the inner listener, but removes the future-refactor footgun. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 08:54:42 +03:00
tdurieux	9adff11e74	fix(cache): atomic file writes and size-validated cache reads A failed/interrupted GitHub fetch could leave a 0-byte or truncated file in the local cache. Subsequent reads happily streamed the empty content as the file's body — visible to users as an "Empty file" with HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694). - FileSystem.write: stream into a sibling .tmp and rename into place only on finish. Stream errors discard the tmp and leave any prior cached file untouched. Drop the utf-8 encoding that was silently corrupting binary blobs. - GitHubStream.getFileContentCache: accept an expected size and treat cached.size < expected as a poisoned cache (truncated fetch) → rm and re-fetch. cached.size >= expected is accepted, which keeps Git LFS-resolved files (whose FileModel.size is the pointer size) working. - AnonymizedFile: expose size() and pass it through to the streamer alongside sha so the cache check has the upstream size. Existing poisoned entries self-heal on next access. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 08:47:41 +03:00
tdurieux	f0f6436370	feat: resolve Git LFS pointers via the raw URL endpoint Files tracked by Git LFS used to come out as the pointer text: version https://git-lfs.github.com/spec/v1 oid sha256:... size ... …because GitHub's blob API returns the pointer, not the resolved content. Detect that prefix on the first ~150 bytes of the blob stream and switch to a fresh fetch via the web raw URL (github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects to media.githubusercontent.com and resolves the LFS object — auth header carries through. Non-LFS files are forwarded through the existing pipeline unchanged. Fixes #95.	2026-05-04 12:18:55 +02:00
tdurieux	a5f66d6844	multiple fixes	2026-05-03 15:30:54 +02:00
Thomas Durieux	f4209110c7	Fix all 93 ESLint issues (3 errors, 90 warnings) (#666 )	2026-04-15 09:04:22 +02:00
Thomas Durieux	655ae92c4c	Remove OpenTelemetry tracing infrastructure (#662 )	2026-04-15 04:39:08 +02:00
Thomas Durieux	f3641c8ce3	Set up CI with ESLint linter and Mocha test runner (#661 )	2026-04-15 04:34:03 +02:00
tdurieux	dcf483ea03	feat: improve download anonymized repository	2024-05-06 11:52:32 +02:00
tdurieux	17abc47d08	fix: fix webview on root repo	2024-04-28 08:08:39 +01:00
tdurieux	a86e050f8b	fix: handle empty repository	2024-04-26 13:48:32 +01:00
tdurieux	710f7328e7	feat: flatten file tree for better performance	2024-04-26 10:32:09 +01:00
tdurieux	1d4bab7866	fix: fix webview & improve download progress	2024-04-03 18:25:33 +01:00
tdurieux	db67f53b2c	fix: fix GitHubDownload	2024-04-03 13:24:34 +01:00
tdurieux	4d12641c7e	feat: introduce streamers that handle the stream and anonymization from github	2024-04-03 11:13:01 +01:00

24 Commits