Commit Graph

324 Commits

Author SHA1 Message Date
tdurieux 8fc7ac5175 Add user ban/activate feature
Add admin endpoints to ban and activate users, block banned users
from all auth flows (OAuth, token login, bearer auth), and invalidate
existing sessions on next request. Includes frontend translation and
user detail page ban/activate buttons.
2026-05-07 05:41:12 +03:00
tdurieux d9104c2ec2 Update commit on branch refresh and validate commit exists on save
Refresh button now always updates the commit to the latest SHA instead
of preserving the stale one in edit mode. Both create and update routes
verify the commit still exists on GitHub before persisting.
2026-05-06 21:14:53 +03:00
tdurieux bd8656206a fix persistance bugs 2026-05-06 20:00:59 +03:00
tdurieux 67cb2538b1 fix old github download repos 2026-05-06 19:37:16 +03:00
tdurieux da78708b7b Improve error handling 2026-05-06 18:43:36 +03:00
tdurieux aae6eae6eb handle rate limit 2026-05-06 17:50:01 +03:00
tdurieux 6bad6c2f09 fix bugs and report better errors 2026-05-06 17:26:47 +03:00
tdurieux 3b27816702 fix incremental 2026-05-06 17:12:58 +03:00
tdurieux 804bbffb7a Improve error handling 2026-05-06 17:03:19 +03:00
tdurieux 48e782946a Improve error handling 2026-05-06 16:56:07 +03:00
tdurieux cf2f172aca fix(gist): set gist subpaths individually to avoid CastError
Mongoose treats `gist` as a nested path, not a sub-schema, so
set("gist", payload) mis-casts the inner subdoc arrays and fails
validation with 'Cast to [string] failed' at gist.files.0. Set each
subpath individually so the files/comments arrays cast correctly.
2026-05-06 16:52:48 +03:00
tdurieux dcb524c8c1 Improve error handling 2026-05-06 16:45:22 +03:00
tdurieux 3613c895c8 improve logging 2026-05-06 16:31:10 +03:00
tdurieux 873c910dd3 Improve error dashboard 2026-05-06 16:12:37 +03:00
tdurieux 6f418d6332 log AnonymousError context detail in serialized output
Surface the path/repo/url tied to an AnonymousError when it gets
serialized for logging — previously logs only carried name, message,
and httpStatus, which made file_not_found entries impossible to trace
back to a specific file or repo. Extract the existing detail formatting
out of toString() into a public detail() method, harden it against
AnonymizedFile getters that can throw, and have serializeError include
the result as a "detail" field.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 13:46:06 +03:00
tdurieux 06a098fba7 perf(deploy): faster builds and zero-downtime streamer rollover
- Multi-stage Dockerfile with BuildKit npm cache mounts and a separate
  prod-deps stage so source edits don't reinstall or prune.
- Tighter .dockerignore to shrink build context.
- Healthchecks: add start_period and tighten interval/retries so
  containers report healthy as soon as the process is actually ready
  instead of after a full polling interval.
- Move recoverStuckPreparing() off the startup critical path; the
  recovery sweep now runs in the background after app.listen.
- depends_on uses condition: service_healthy and the obsolete
  compose 'version' key is gone.
- New scripts/build.sh + scripts/deploy.sh: deploy.sh builds, exits
  early if the image is unchanged, runs a blue/green streamer swap
  (scale to 2N, wait healthy in parallel, drop olds), then recreates
  the API with --no-deps to avoid compose's depends_on re-poll.
2026-05-06 13:38:19 +03:00
tdurieux 371693dc3b fix: end stream for files <= 150 bytes in resolveLfsPointer
The LFS-pointer probe buffered up to 150 bytes before deciding whether
to forward the blob or swap to the raw URL. For blobs that fit entirely
in the probe, decide() ran from the source's end event and attached
data/end listeners to an already-ended stream, so out.end() was never
called. The response hung until upstream timed out and storage.write
left an incomplete cached copy, which then forced a re-fetch on every
subsequent read.

Pass a sourceEnded flag through decide() and end the output directly
when the source has already finished. Also skip the GitHub blob fetch
when the tree size is already over MAX_FILE_SIZE, surfacing
file_too_big instead of a translated 422.
2026-05-06 13:10:25 +03:00
tdurieux 7dd6d872e9 Improve error handling 2026-05-06 12:44:07 +03:00
tdurieux e1cf8e3a00 Improve error handling 2026-05-06 12:26:12 +03:00
tdurieux c4182b5b2a fix: prevent E11000 on concurrent getRepositoryFromGitHub
Replace check-then-insert with atomic findOneAndUpdate upsert keyed on
externalId, plus a single E11000 retry fallback. Eliminates the duplicate
key race when two requests resolve the same gh_<id> concurrently.
2026-05-06 12:22:42 +03:00
tdurieux c2d43164d0 error logging improvement, regex fix 2026-05-06 11:16:12 +03:00
dependabot[bot] e34f45522f chore(deps): bump uuid and bullmq (#691)
* chore(deps): bump uuid and bullmq

Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependency [bullmq](https://github.com/taskforcesh/bullmq). These dependencies need to be updated together.


Removes `uuid`

Updates `bullmq` from 2.4.0 to 5.76.5
- [Release notes](https://github.com/taskforcesh/bullmq/releases)
- [Commits](https://github.com/taskforcesh/bullmq/compare/v2.4.0...v5.76.5)

---
updated-dependencies:
- dependency-name: bullmq
  dependency-version: 5.76.5
  dependency-type: direct:production
- dependency-name: uuid
  dependency-version: 
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix(bullmq): adapt isRunning + getJobs typing for v5 API

Worker.isRunning became a method (was a property in v2), and
Queue.getJobs now requires a mutable JobType[] (was string[]).

* clean up

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tdurieux <durieuxthomas@hotmail.com>
Co-authored-by: Thomas Durieux <5577568+tdurieux@users.noreply.github.com>
2026-05-06 08:40:12 +02:00
tdurieux 7e0f900c7a fix: resolve eslint unused-var and useless-assignment warnings
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 08:22:45 +03:00
tdurieux 79f555769d improve binary file detection: content sniffing + jsonl support
Files like .jsonl that mime-types doesn't know fell through to
application/octet-stream and rendered as "Unsupported binary file" in
the viewer. Replace istextorbinary with isbinaryfile for content-based
detection, and use mime-types for name-based classification with a
textual application/* allowlist.

The streaming transformer now defers classification when the name is
inconclusive and sniffs the first chunk before emitting "transform",
so route.ts and AnonymizedFile.ts get a content-aware Content-Type.
Whitelists .jsonl and .ndjson to short-circuit dataset files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 07:52:48 +03:00
tdurieux 27d6b56da7 fix mulitple bugs 2026-05-05 12:34:03 +03:00
tdurieux c903627da0 fix build 2026-05-05 10:35:08 +03:00
tdurieux f8c91ca0af multiple fixes 2026-05-05 10:32:31 +03:00
tdurieux 5b72b630c4 fix: silent-truncation, token-refresh, and content-type bugs across hot paths
Follow-up review pass after the cache fixes turned up several bugs in
the same family — silent failures that look like success to the client,
plus content-correctness issues in the ZIP and per-file delivery paths.

- zipStream: stop calling archive.finalize() on upstream/parser errors.
  That produced a valid-looking ZIP (200 OK, archive opens) silently
  missing entries — same class as #694, but worse because the user has
  no signal anything went wrong. Destroy the response on failure
  instead so the client sees a connection drop.
- zipStream: apply per-repo image/pdf gates inside the entry handler.
  The single-file /file/... endpoint refuses to serve those types
  via AnonymizedFile.isFileSupported when image=false / pdf=false, but
  the ZIP shipped them anyway — privacy-relevant for maintainers who
  toggle image=false to suppress identifying screenshots. Threaded
  contentOptions through both ZIP entry points (direct and streamer).
- GitHubUtils.getToken: validate the OAuth token-refresh response
  before persisting. On a non-2xx response or a body without a string
  token, we used to overwrite the stored token with `undefined`, which
  then propagated as `Authorization: token undefined` to every API
  call — 401 even on public repos, with the config.GITHUB_TOKEN
  fallback unreachable because the field was no longer falsy.
- AnonymizedFile.send (streamer branch): forward Content-Type from the
  upstream streamer response. got.stream(...).pipe(res) carries body
  bytes only, so the parent response had no Content-Type and browsers
  guessed (text rendered as download, etc.). Also resolve on
  res.on("finish") in addition to "close" — keep-alive sockets stay
  open long after the response is delivered, delaying countView().
- Repository.updateIfNeeded: persist a renamed source.repositoryName
  even when the commit hasn't changed. Previously the new value lived
  in memory only and was overwritten on the next reload, so the
  rename detection ran every request.
- Repository.anonymize: stop materialising a dummy {path:"",name:""}
  FileModel for empty repos. That row collided with the special case
  in AnonymizedFile.getFileInfo and surfaced in unfiltered listings.
- streamer/route POST /: reject filePath segments containing ".." or
  empty parts. Defence in depth — the parent server validates against
  FileModel before calling, but the streamer joins filePath straight
  into the storage path, so any future caller forwarding an
  unvalidated path could traverse out of the repo root.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 09:19:05 +03:00
tdurieux f413a30313 fix(cache): make Zip-source caches atomic and robust to partial state
Follow-up to the GitHubStream cache fixes. The same poisoned-cache
class existed in the GitHubDownload path and a few related spots:

- GitHubDownload.download: wipe pre-existing state before extracting
  and write a .anon-complete marker only after a successful extract.
  On error, rm the partial cache so a retry starts clean. getFileContent
  and getFiles now gate on the marker instead of "any file/folder
  exists," so a half-extracted tree can never be served as canonical.
- GitHubDownload.getFileContent: validate cached file size against the
  upstream FileModel size (via the new AnonymizedFile.size()), same
  guard as GitHubStream. getFiles filters the marker from the listing.
- FileSystem.listFiles: drop the bogus stats.ino.toString() as sha.
  An inode isn't a content hash; anything comparing it to a Git blob
  sha would silently disagree. Leave undefined.
- S3.write: remove the fire-and-forget data.on("error") -> this.rm(...).
  Multipart Upload doesn't commit partial objects, so there was nothing
  to clean up, and the handler raced retries and could delete a
  previously-good object on a transient source-stream hiccup. The
  size-validated read path recovers from any other undersized objects.
- GitHubStream.resolveLfsPointer: drop the post-decision early-return
  in blobStream.on("error"). Currently redundant with the inner
  listener, but removes the future-refactor footgun.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 08:54:42 +03:00
tdurieux 9adff11e74 fix(cache): atomic file writes and size-validated cache reads
A failed/interrupted GitHub fetch could leave a 0-byte or truncated
file in the local cache. Subsequent reads happily streamed the empty
content as the file's body — visible to users as an "Empty file" with
HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694).

- FileSystem.write: stream into a sibling .tmp and rename into place
  only on finish. Stream errors discard the tmp and leave any prior
  cached file untouched. Drop the utf-8 encoding that was silently
  corrupting binary blobs.
- GitHubStream.getFileContentCache: accept an expected size and treat
  cached.size < expected as a poisoned cache (truncated fetch) → rm
  and re-fetch. cached.size >= expected is accepted, which keeps
  Git LFS-resolved files (whose FileModel.size is the pointer size)
  working.
- AnonymizedFile: expose size() and pass it through to the streamer
  alongside sha so the cache check has the upstream size.

Existing poisoned entries self-heal on next access.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 08:47:41 +03:00
tdurieux 53959f677c fix(repo): reuse existing record on rename to avoid externalId duplicate key
When a GitHub repo is renamed and looked up by its new name, the lookup
by name misses but a record with the same externalId still exists,
causing E11000 on save. Fall back to a lookup by externalId before
creating a new document.

Fixes #500

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 00:59:38 +03:00
tdurieux dee406e2ea update design 2026-05-05 00:36:42 +03:00
tdurieux 49b124e188 fix regex 2026-05-05 00:07:49 +03:00
tdurieux f0bc53f093 feat: gist & co-authors 2026-05-04 13:10:44 +02:00
tdurieux f0f6436370 feat: resolve Git LFS pointers via the raw URL endpoint
Files tracked by Git LFS used to come out as the pointer text:

    version https://git-lfs.github.com/spec/v1
    oid sha256:...
    size ...

…because GitHub's blob API returns the pointer, not the resolved
content. Detect that prefix on the first ~150 bytes of the blob stream
and switch to a fresh fetch via the web raw URL
(github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects
to media.githubusercontent.com and resolves the LFS object — auth
header carries through. Non-LFS files are forwarded through the
existing pipeline unchanged.

Fixes #95.
2026-05-04 12:18:55 +02:00
tdurieux ef78e8ff3c feat: preserve raw bytes when anonymization is a no-op
When the anonymizer doesn't change a slice's text, the streamer used
to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes
in the input (replaced by U+FFFD via StringDecoder). Files
mistakenly classified as text (binary blobs without a known extension,
text with stray non-UTF-8 bytes, BOMs) came out corrupted even though
nothing in the term list matched.

Track the raw chunk bytes alongside the decoded `pending`. On flush —
where we have every byte buffered — emit the original buffer directly
when the output equals the input, so a pure passthrough is bit-exact.
In the streaming OVERLAP path, do the same when the decode for that
slice round-trips losslessly; fall back to encoded output otherwise
(unchanged from before for that case).

Also add the "missing_content" locale entry for the
/api/anonymize-preview route.
2026-05-04 11:52:03 +02:00
tdurieux 4aeecd0fdb fix: recognize LICENSE / COPYING / etc. as text files
istextorbinary returns null for filenames with no extension, and the
isTextFile() guard treated null as "not text" — so terms in LICENSE,
COPYING, AUTHORS, README (extensionless), CHANGELOG, NOTICE, and
similar conventional filenames went through the binary passthrough
in AnonymizeTransformer and were never anonymized.

Add a small whitelist of these names ahead of the istextorbinary call.

Fixes #493.
2026-05-04 11:34:54 +02:00
tdurieux f91db91cee wip 2026-05-04 11:30:42 +02:00
tdurieux 652299cda6 fix: follow GitHub repo rename via stored repository id
When a user renamed the original GitHub repository, anonymous_github
kept calling oct.repos.get({owner, repo}) with the cached old name and
got 404, marking the link broken even though the repository still
existed at a new path.

Recover the numeric GitHub id from the dbModel's externalId
("gh_<id>"). On a 404 from the name-based lookup, fall back to
GET /repositories/<id>, which returns the renamed repo. The caller
already updates source.repositoryName from r.full_name afterwards.

Fixes #409.
2026-05-04 11:15:32 +02:00
tdurieux 117406f2ce refactor: route anonymize preview through the backend
The form's live README/PR preview was running its own copy of
ContentAnonimizer in the browser. The two implementations had been
drifting — recent fixes for word boundaries (#175/#249), accent
matching (#280), custom replacements (#285), and the diacritic-stripped
variants only landed on the server. Reviewers saw one anonymization;
authors composing the form saw another.

Add POST /api/anonymize-preview that takes a snippet (or a batch) plus
the user's options and runs them through the same ContentAnonimizer
the file route uses. Replace the client-side anonymizeReadme() body
with a debounced call to that endpoint. The PR view's
anonymizePrContent() runs as a synchronous template expression, so it
now reads from a {original -> anonymized} cache that's refreshed in
the background whenever the PR details, terms, or options change.

Single-flight + debounce keep the form responsive; an in-flight
request is dropped on the next change.
2026-05-04 11:05:50 +02:00
tdurieux 3f095f0734 always use githubStream 2026-05-04 10:47:58 +02:00
tdurieux 4bc83db416 feat: per-term anonymization output via term=>replacement
A term entered as "Anonymous=>ABC" now scrubs "Anonymous" to "ABC"
instead of "XXXX-N". Lets users keep anonymized identifiers valid in
source code (no hyphen) and align tokens between paper text and repo.
Indexing for default-mask terms is unchanged: a list of
"Alpha=>AAA", "Beta" still produces XXXX-2 for Beta.

Fixes #285.
2026-05-04 09:31:31 +02:00
tdurieux a30ab7fb96 fix: don't declare Accept-Ranges: none for binary files
The server set Accept-Ranges: none on every file response. For text we
anonymize on the fly so byte ranges aren't meaningful, but binary
entries pass through unchanged — and the explicit "none" header makes
some browsers refuse to play <video>/<audio> elements that would
otherwise fall back to a full download. Newly uploaded MP4s under the
inline-preview threshold rendered as a blank progress bar (#538).

Only set Accept-Ranges: none for text entries; let binary entries omit
it so the standard fallback kicks in.

Fixes #538.
2026-05-03 21:23:59 +02:00
tdurieux 88fe8570fd fix: include file path in cache ETag
Without the path, two different files in the same repo (same sha, same
anonymization options) shared an ETag. If a browser ever sent the cached
ETag for one file while requesting another, the server would have
returned 304 against the wrong cache entry. Fold the path into the
ETag so each file has its own fingerprint.

Follow-up to b3c1030 (#439).
2026-05-03 21:19:39 +02:00
tdurieux b3c1030e5c fix: revalidate cached files when anonymization options change
Files were being served with Cache-Control: max-age=18144000 (210 days)
keyed only on the upstream ?v=<sha>. Editing the term list left the
same URL serving stale anonymized bytes — visible to users in regular
tabs but not in incognito. The previous fix-by-incognito recipe in #439
is exactly this.

Switch to ETag-based revalidation that fingerprints both the upstream
sha and the saved anonymization options, with Cache-Control:
no-cache, must-revalidate. Browsers now revalidate on every request and
get a 304 when nothing has changed, or fresh content as soon as terms,
image/link/etc. options are updated.

Fixes #439.
2026-05-03 20:31:32 +02:00
tdurieux 2eb19904db fix: anonymize accented and unaccented variants of the same term
When a user added "Davó" to the term list, "Davo" elsewhere in the
content was left untouched (and vice versa). Each term now also runs a
diacritic-insensitive pass: ASCII Latin letters expand to a class
covering common accented siblings, with Unicode-aware lookaround
boundaries so the trailing boundary still fires next to "ó" etc.

Pure helpers moved into src/core/term-matching so the test file can
import them instead of duplicating the logic.

Fixes #280.
2026-05-03 20:18:49 +02:00
tdurieux d8b129c670 fix: anonymize entries when downloading the full repo as a zip
The streaming zip pipeline was constructing AnonymizeTransformer first and
then assigning opt.filePath afterwards. AnonymizeTransformer determines
isText in its constructor from opt.filePath, so every entry was classified
as binary and passed through unchanged — the downloaded zip leaked the
original (un-anonymized) terms even though the web view scrubbed them.

Pass filePath via the constructor so isText is computed correctly.

Fixes #342, #349.
2026-05-03 19:47:10 +02:00
tdurieux 9313c42fcf fix: match terms whose edges are non-word characters
Wrapping every user term as `\b${term}\b` silently dropped matches when
the term started or ended with a non-word char (e.g. `@tdurieux`,
`@author .*`), because JS `\b` only fires at a word/non-word transition.

Replace with `withWordBoundaries()`, which only emits `\b` on the side
where the term has a word-char edge.

Fixes #175, #249.
2026-05-03 18:35:00 +02:00
tdurieux a5f66d6844 multiple fixes 2026-05-03 15:30:54 +02:00
tdurieux 88f826aab4 update design 2026-04-24 14:55:18 +02:00