Commit Graph

615 Commits

Author SHA1 Message Date
tdurieux aae6eae6eb handle rate limit 2026-05-06 17:50:01 +03:00
tdurieux c1e18f82a9 Improve error handling 2026-05-06 17:39:43 +03:00
tdurieux 6bad6c2f09 fix bugs and report better errors 2026-05-06 17:26:47 +03:00
tdurieux 3b27816702 fix incremental 2026-05-06 17:12:58 +03:00
tdurieux 804bbffb7a Improve error handling 2026-05-06 17:03:19 +03:00
tdurieux 48e782946a Improve error handling 2026-05-06 16:56:07 +03:00
tdurieux b2461088e8 fix test 2026-05-06 16:55:50 +03:00
tdurieux cf2f172aca fix(gist): set gist subpaths individually to avoid CastError
Mongoose treats `gist` as a nested path, not a sub-schema, so
set("gist", payload) mis-casts the inner subdoc arrays and fails
validation with 'Cast to [string] failed' at gist.files.0. Set each
subpath individually so the files/comments arrays cast correctly.
2026-05-06 16:52:48 +03:00
tdurieux dcb524c8c1 Improve error handling 2026-05-06 16:45:22 +03:00
tdurieux 3613c895c8 improve logging 2026-05-06 16:31:10 +03:00
tdurieux 873c910dd3 Improve error dashboard 2026-05-06 16:12:37 +03:00
tdurieux 6f418d6332 log AnonymousError context detail in serialized output
Surface the path/repo/url tied to an AnonymousError when it gets
serialized for logging — previously logs only carried name, message,
and httpStatus, which made file_not_found entries impossible to trace
back to a specific file or repo. Extract the existing detail formatting
out of toString() into a public detail() method, harden it against
AnonymizedFile getters that can throw, and have serializeError include
the result as a "detail" field.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 13:46:06 +03:00
tdurieux 7dc48bccd2 fix(build): isolate parallel npm cache mounts and bump node to 22
- Give the deps and prod-deps stages distinct BuildKit cache IDs with
  sharing=locked so concurrent npm ci runs no longer race on
  /root/.npm and fail with ENOTEMPTY.
- Drop the post-install 'npm cache clean --force' in prod-deps; it
  was wiping the cache mount it had just filled.
- Update base images from node:21 to node:22 (LTS) to silence
  EBADENGINE warnings from packages requiring node >= 22.
2026-05-06 13:40:17 +03:00
tdurieux 06a098fba7 perf(deploy): faster builds and zero-downtime streamer rollover
- Multi-stage Dockerfile with BuildKit npm cache mounts and a separate
  prod-deps stage so source edits don't reinstall or prune.
- Tighter .dockerignore to shrink build context.
- Healthchecks: add start_period and tighten interval/retries so
  containers report healthy as soon as the process is actually ready
  instead of after a full polling interval.
- Move recoverStuckPreparing() off the startup critical path; the
  recovery sweep now runs in the background after app.listen.
- depends_on uses condition: service_healthy and the obsolete
  compose 'version' key is gone.
- New scripts/build.sh + scripts/deploy.sh: deploy.sh builds, exits
  early if the image is unchanged, runs a blue/green streamer swap
  (scale to 2N, wait healthy in parallel, drop olds), then recreates
  the API with --no-deps to avoid compose's depends_on re-poll.
2026-05-06 13:38:19 +03:00
tdurieux 371693dc3b fix: end stream for files <= 150 bytes in resolveLfsPointer
The LFS-pointer probe buffered up to 150 bytes before deciding whether
to forward the blob or swap to the raw URL. For blobs that fit entirely
in the probe, decide() ran from the source's end event and attached
data/end listeners to an already-ended stream, so out.end() was never
called. The response hung until upstream timed out and storage.write
left an incomplete cached copy, which then forced a re-fetch on every
subsequent read.

Pass a sourceEnded flag through decide() and end the output directly
when the source has already finished. Also skip the GitHub blob fetch
when the tree size is already over MAX_FILE_SIZE, surfacing
file_too_big instead of a translated 422.
2026-05-06 13:10:25 +03:00
tdurieux 7dd6d872e9 Improve error handling 2026-05-06 12:44:07 +03:00
tdurieux e1cf8e3a00 Improve error handling 2026-05-06 12:26:12 +03:00
tdurieux c4182b5b2a fix: prevent E11000 on concurrent getRepositoryFromGitHub
Replace check-then-insert with atomic findOneAndUpdate upsert keyed on
externalId, plus a single E11000 retry fallback. Eliminates the duplicate
key race when two requests resolve the same gh_<id> concurrently.
2026-05-06 12:22:42 +03:00
tdurieux 75b67448ef fix 2026-05-06 11:24:48 +03:00
tdurieux 1254c56223 fix: dedupe folder contents on re-fetch (#701)
getFiles blindly appended fetched entries to $scope.files, so
re-opening a folder duplicated its children in the tree. Drop any
existing entries at the requested path before appending.
2026-05-06 11:16:12 +03:00
tdurieux c2d43164d0 error logging improvement, regex fix 2026-05-06 11:16:12 +03:00
dependabot[bot] e34f45522f chore(deps): bump uuid and bullmq (#691)
* chore(deps): bump uuid and bullmq

Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependency [bullmq](https://github.com/taskforcesh/bullmq). These dependencies need to be updated together.


Removes `uuid`

Updates `bullmq` from 2.4.0 to 5.76.5
- [Release notes](https://github.com/taskforcesh/bullmq/releases)
- [Commits](https://github.com/taskforcesh/bullmq/compare/v2.4.0...v5.76.5)

---
updated-dependencies:
- dependency-name: bullmq
  dependency-version: 5.76.5
  dependency-type: direct:production
- dependency-name: uuid
  dependency-version: 
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix(bullmq): adapt isRunning + getJobs typing for v5 API

Worker.isRunning became a method (was a property in v2), and
Queue.getJobs now requires a mutable JobType[] (was string[]).

* clean up

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tdurieux <durieuxthomas@hotmail.com>
Co-authored-by: Thomas Durieux <5577568+tdurieux@users.noreply.github.com>
2026-05-06 08:40:12 +02:00
tdurieux 2528f42b59 add tests 2026-05-06 08:23:22 +03:00
tdurieux 7e0f900c7a fix: resolve eslint unused-var and useless-assignment warnings
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 08:22:45 +03:00
dependabot[bot] e306adcc97 chore(deps): bump ip-address and socks (#697)
Bumps [ip-address](https://github.com/beaugunderson/ip-address) and [socks](https://github.com/JoshGlazebrook/socks). These dependencies needed to be updated together.

Updates `ip-address` from 9.0.5 to 10.2.0
- [Commits](https://github.com/beaugunderson/ip-address/commits)

Updates `socks` from 2.8.4 to 2.8.8
- [Release notes](https://github.com/JoshGlazebrook/socks/releases)
- [Commits](https://github.com/JoshGlazebrook/socks/compare/2.8.4...2.8.8)

---
updated-dependencies:
- dependency-name: ip-address
  dependency-version: 10.2.0
  dependency-type: indirect
- dependency-name: socks
  dependency-version: 2.8.8
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-06 07:21:33 +02:00
dependabot[bot] 78ff8eaa9b chore(deps): bump mongoose from 7.8.4 to 7.8.9 (#698)
Bumps [mongoose](https://github.com/Automattic/mongoose) from 7.8.4 to 7.8.9.
- [Release notes](https://github.com/Automattic/mongoose/releases)
- [Changelog](https://github.com/Automattic/mongoose/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Automattic/mongoose/commits)

---
updated-dependencies:
- dependency-name: mongoose
  dependency-version: 7.8.9
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-06 07:21:17 +02:00
tdurieux 79f555769d improve binary file detection: content sniffing + jsonl support
Files like .jsonl that mime-types doesn't know fell through to
application/octet-stream and rendered as "Unsupported binary file" in
the viewer. Replace istextorbinary with isbinaryfile for content-based
detection, and use mime-types for name-based classification with a
textual application/* allowlist.

The streaming transformer now defers classification when the name is
inconclusive and sniffs the first chunk before emitting "transform",
so route.ts and AnonymizedFile.ts get a content-aware Content-Type.
Whitelists .jsonl and .ndjson to short-circuit dataset files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 07:52:48 +03:00
tdurieux 18ce39e019 make repo error messages actionable (#550, #379, #519)
Rewrite repo_access_limited, repo_not_found, repo_empty, and
repo_not_accessible to point users at concrete next steps instead
of stating only that something failed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 07:50:41 +03:00
tdurieux 27d6b56da7 fix mulitple bugs 2026-05-05 12:34:03 +03:00
tdurieux 7384638993 dark theme support 2026-05-05 10:40:12 +03:00
tdurieux c903627da0 fix build 2026-05-05 10:35:08 +03:00
tdurieux f8c91ca0af multiple fixes 2026-05-05 10:32:31 +03:00
tdurieux 5b72b630c4 fix: silent-truncation, token-refresh, and content-type bugs across hot paths
Follow-up review pass after the cache fixes turned up several bugs in
the same family — silent failures that look like success to the client,
plus content-correctness issues in the ZIP and per-file delivery paths.

- zipStream: stop calling archive.finalize() on upstream/parser errors.
  That produced a valid-looking ZIP (200 OK, archive opens) silently
  missing entries — same class as #694, but worse because the user has
  no signal anything went wrong. Destroy the response on failure
  instead so the client sees a connection drop.
- zipStream: apply per-repo image/pdf gates inside the entry handler.
  The single-file /file/... endpoint refuses to serve those types
  via AnonymizedFile.isFileSupported when image=false / pdf=false, but
  the ZIP shipped them anyway — privacy-relevant for maintainers who
  toggle image=false to suppress identifying screenshots. Threaded
  contentOptions through both ZIP entry points (direct and streamer).
- GitHubUtils.getToken: validate the OAuth token-refresh response
  before persisting. On a non-2xx response or a body without a string
  token, we used to overwrite the stored token with `undefined`, which
  then propagated as `Authorization: token undefined` to every API
  call — 401 even on public repos, with the config.GITHUB_TOKEN
  fallback unreachable because the field was no longer falsy.
- AnonymizedFile.send (streamer branch): forward Content-Type from the
  upstream streamer response. got.stream(...).pipe(res) carries body
  bytes only, so the parent response had no Content-Type and browsers
  guessed (text rendered as download, etc.). Also resolve on
  res.on("finish") in addition to "close" — keep-alive sockets stay
  open long after the response is delivered, delaying countView().
- Repository.updateIfNeeded: persist a renamed source.repositoryName
  even when the commit hasn't changed. Previously the new value lived
  in memory only and was overwritten on the next reload, so the
  rename detection ran every request.
- Repository.anonymize: stop materialising a dummy {path:"",name:""}
  FileModel for empty repos. That row collided with the special case
  in AnonymizedFile.getFileInfo and surfaced in unfiltered listings.
- streamer/route POST /: reject filePath segments containing ".." or
  empty parts. Defence in depth — the parent server validates against
  FileModel before calling, but the streamer joins filePath straight
  into the storage path, so any future caller forwarding an
  unvalidated path could traverse out of the repo root.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 09:19:05 +03:00
tdurieux f413a30313 fix(cache): make Zip-source caches atomic and robust to partial state
Follow-up to the GitHubStream cache fixes. The same poisoned-cache
class existed in the GitHubDownload path and a few related spots:

- GitHubDownload.download: wipe pre-existing state before extracting
  and write a .anon-complete marker only after a successful extract.
  On error, rm the partial cache so a retry starts clean. getFileContent
  and getFiles now gate on the marker instead of "any file/folder
  exists," so a half-extracted tree can never be served as canonical.
- GitHubDownload.getFileContent: validate cached file size against the
  upstream FileModel size (via the new AnonymizedFile.size()), same
  guard as GitHubStream. getFiles filters the marker from the listing.
- FileSystem.listFiles: drop the bogus stats.ino.toString() as sha.
  An inode isn't a content hash; anything comparing it to a Git blob
  sha would silently disagree. Leave undefined.
- S3.write: remove the fire-and-forget data.on("error") -> this.rm(...).
  Multipart Upload doesn't commit partial objects, so there was nothing
  to clean up, and the handler raced retries and could delete a
  previously-good object on a transient source-stream hiccup. The
  size-validated read path recovers from any other undersized objects.
- GitHubStream.resolveLfsPointer: drop the post-decision early-return
  in blobStream.on("error"). Currently redundant with the inner
  listener, but removes the future-refactor footgun.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 08:54:42 +03:00
tdurieux 9adff11e74 fix(cache): atomic file writes and size-validated cache reads
A failed/interrupted GitHub fetch could leave a 0-byte or truncated
file in the local cache. Subsequent reads happily streamed the empty
content as the file's body — visible to users as an "Empty file" with
HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694).

- FileSystem.write: stream into a sibling .tmp and rename into place
  only on finish. Stream errors discard the tmp and leave any prior
  cached file untouched. Drop the utf-8 encoding that was silently
  corrupting binary blobs.
- GitHubStream.getFileContentCache: accept an expected size and treat
  cached.size < expected as a poisoned cache (truncated fetch) → rm
  and re-fetch. cached.size >= expected is accepted, which keeps
  Git LFS-resolved files (whose FileModel.size is the pointer size)
  working.
- AnonymizedFile: expose size() and pass it through to the streamer
  alongside sha so the cache check has the upstream size.

Existing poisoned entries self-heal on next access.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 08:47:41 +03:00
tdurieux 53959f677c fix(repo): reuse existing record on rename to avoid externalId duplicate key
When a GitHub repo is renamed and looked up by its new name, the lookup
by name misses but a record with the same externalId still exists,
causing E11000 on save. Fall back to a lookup by externalId before
creating a new document.

Fixes #500

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 00:59:38 +03:00
tdurieux 1dde5dc308 docker: multi-stage build with alpine runtime
Split into builder (node:21-slim, full deps + tsc/gulp) and runtime
(node:21-alpine, production deps only). Drops ~hundreds of MB from
the published image and removes dev tooling from the runtime layer.
2026-05-05 00:52:55 +03:00
tdurieux ab266f2188 fix(i18n): add missing error translations for gist, auth, and co-author flows
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 00:50:40 +03:00
tdurieux 4cff32ec93 ci: build only linux/amd64 to avoid arm64 QEMU npm flakiness 2026-05-05 00:48:14 +03:00
tdurieux 0cfa95af77 ci: add Docker Hub publish workflow
Builds and pushes tdurieux/anonymous_github to Docker Hub on push to
main, v* tags, and manual dispatch. Multi-arch (amd64/arm64) with GHA
layer caching. Closes #478.
2026-05-05 00:44:57 +03:00
tdurieux dee406e2ea update design 2026-05-05 00:36:42 +03:00
tdurieux 49b124e188 fix regex 2026-05-05 00:07:49 +03:00
tdurieux f0bc53f093 feat: gist & co-authors 2026-05-04 13:10:44 +02:00
tdurieux f0f6436370 feat: resolve Git LFS pointers via the raw URL endpoint
Files tracked by Git LFS used to come out as the pointer text:

    version https://git-lfs.github.com/spec/v1
    oid sha256:...
    size ...

…because GitHub's blob API returns the pointer, not the resolved
content. Detect that prefix on the first ~150 bytes of the blob stream
and switch to a fresh fetch via the web raw URL
(github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects
to media.githubusercontent.com and resolves the LFS object — auth
header carries through. Non-LFS files are forwarded through the
existing pipeline unchanged.

Fixes #95.
2026-05-04 12:18:55 +02:00
tdurieux 7ace730960 fix: fall back to default branch when resolving relative image URLs
The Anonymize form's preview built the readme baseUrl as
"https://github.com/<owner>/<repo>/raw/<source.branch>/". When the
form rendered before the branch field had populated (initial load,
or while waiting on getBranches), the URL became ".../raw//" and
the browser collapsed the empty segment, fetching ".../raw/<file>"
instead of ".../raw/<branch>/<file>". Relative <img src="./X">
references then 404'd against a path with no branch — exactly the
"branch missing" pattern in #407.

Fall back to details.defaultBranch (then "main") so the base URL is
always well-formed.

Fixes #407.
2026-05-04 12:01:56 +02:00
tdurieux 62a2c1cd5c fix: force-fetch repo details so newly enabled features show up
The Anonymize form used the cached RepositoryModel for hasPage,
defaultBranch, etc. — so enabling GitHub Pages (or changing the
default branch) on the source after first cache wouldn't reflect in
the UI, leaving the GitHub Pages checkbox grayed out.

Pass force=1 when loading the form's repo details so the backend
re-queries the GitHub API once. The cost is a single GET /repos/...
call per form load.

Fixes #364.
2026-05-04 11:58:44 +02:00
tdurieux ef78e8ff3c feat: preserve raw bytes when anonymization is a no-op
When the anonymizer doesn't change a slice's text, the streamer used
to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes
in the input (replaced by U+FFFD via StringDecoder). Files
mistakenly classified as text (binary blobs without a known extension,
text with stray non-UTF-8 bytes, BOMs) came out corrupted even though
nothing in the term list matched.

Track the raw chunk bytes alongside the decoded `pending`. On flush —
where we have every byte buffered — emit the original buffer directly
when the output equals the input, so a pure passthrough is bit-exact.
In the streaming OVERLAP path, do the same when the decode for that
slice round-trips losslessly; fall back to encoded output otherwise
(unchanged from before for that case).

Also add the "missing_content" locale entry for the
/api/anonymize-preview route.
2026-05-04 11:52:03 +02:00
tdurieux 4aeecd0fdb fix: recognize LICENSE / COPYING / etc. as text files
istextorbinary returns null for filenames with no extension, and the
isTextFile() guard treated null as "not text" — so terms in LICENSE,
COPYING, AUTHORS, README (extensionless), CHANGELOG, NOTICE, and
similar conventional filenames went through the binary passthrough
in AnonymizeTransformer and were never anonymized.

Add a small whitelist of these names ahead of the istextorbinary call.

Fixes #493.
2026-05-04 11:34:54 +02:00
tdurieux f91db91cee wip 2026-05-04 11:30:42 +02:00
tdurieux 96d9f3cee6 fix: show explorer sidebar 'Updated' timestamp on desktop
The .leftCol-foot was hidden by default and only revealed inside the
mobile/tablet media query, so the last-update line never appeared on
wider viewports.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-04 11:28:10 +02:00