Commit Graph

49 Commits

Author SHA1 Message Date
tdurieux 53959f677c fix(repo): reuse existing record on rename to avoid externalId duplicate key
When a GitHub repo is renamed and looked up by its new name, the lookup
by name misses but a record with the same externalId still exists,
causing E11000 on save. Fall back to a lookup by externalId before
creating a new document.

Fixes #500

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 00:59:38 +03:00
tdurieux dee406e2ea update design 2026-05-05 00:36:42 +03:00
tdurieux 49b124e188 fix regex 2026-05-05 00:07:49 +03:00
tdurieux f0bc53f093 feat: gist & co-authors 2026-05-04 13:10:44 +02:00
tdurieux f0f6436370 feat: resolve Git LFS pointers via the raw URL endpoint
Files tracked by Git LFS used to come out as the pointer text:

    version https://git-lfs.github.com/spec/v1
    oid sha256:...
    size ...

…because GitHub's blob API returns the pointer, not the resolved
content. Detect that prefix on the first ~150 bytes of the blob stream
and switch to a fresh fetch via the web raw URL
(github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects
to media.githubusercontent.com and resolves the LFS object — auth
header carries through. Non-LFS files are forwarded through the
existing pipeline unchanged.

Fixes #95.
2026-05-04 12:18:55 +02:00
tdurieux ef78e8ff3c feat: preserve raw bytes when anonymization is a no-op
When the anonymizer doesn't change a slice's text, the streamer used
to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes
in the input (replaced by U+FFFD via StringDecoder). Files
mistakenly classified as text (binary blobs without a known extension,
text with stray non-UTF-8 bytes, BOMs) came out corrupted even though
nothing in the term list matched.

Track the raw chunk bytes alongside the decoded `pending`. On flush —
where we have every byte buffered — emit the original buffer directly
when the output equals the input, so a pure passthrough is bit-exact.
In the streaming OVERLAP path, do the same when the decode for that
slice round-trips losslessly; fall back to encoded output otherwise
(unchanged from before for that case).

Also add the "missing_content" locale entry for the
/api/anonymize-preview route.
2026-05-04 11:52:03 +02:00
tdurieux 4aeecd0fdb fix: recognize LICENSE / COPYING / etc. as text files
istextorbinary returns null for filenames with no extension, and the
isTextFile() guard treated null as "not text" — so terms in LICENSE,
COPYING, AUTHORS, README (extensionless), CHANGELOG, NOTICE, and
similar conventional filenames went through the binary passthrough
in AnonymizeTransformer and were never anonymized.

Add a small whitelist of these names ahead of the istextorbinary call.

Fixes #493.
2026-05-04 11:34:54 +02:00
tdurieux 652299cda6 fix: follow GitHub repo rename via stored repository id
When a user renamed the original GitHub repository, anonymous_github
kept calling oct.repos.get({owner, repo}) with the cached old name and
got 404, marking the link broken even though the repository still
existed at a new path.

Recover the numeric GitHub id from the dbModel's externalId
("gh_<id>"). On a 404 from the name-based lookup, fall back to
GET /repositories/<id>, which returns the renamed repo. The caller
already updates source.repositoryName from r.full_name afterwards.

Fixes #409.
2026-05-04 11:15:32 +02:00
tdurieux 3f095f0734 always use githubStream 2026-05-04 10:47:58 +02:00
tdurieux 4bc83db416 feat: per-term anonymization output via term=>replacement
A term entered as "Anonymous=>ABC" now scrubs "Anonymous" to "ABC"
instead of "XXXX-N". Lets users keep anonymized identifiers valid in
source code (no hyphen) and align tokens between paper text and repo.
Indexing for default-mask terms is unchanged: a list of
"Alpha=>AAA", "Beta" still produces XXXX-2 for Beta.

Fixes #285.
2026-05-04 09:31:31 +02:00
tdurieux a30ab7fb96 fix: don't declare Accept-Ranges: none for binary files
The server set Accept-Ranges: none on every file response. For text we
anonymize on the fly so byte ranges aren't meaningful, but binary
entries pass through unchanged — and the explicit "none" header makes
some browsers refuse to play <video>/<audio> elements that would
otherwise fall back to a full download. Newly uploaded MP4s under the
inline-preview threshold rendered as a blank progress bar (#538).

Only set Accept-Ranges: none for text entries; let binary entries omit
it so the standard fallback kicks in.

Fixes #538.
2026-05-03 21:23:59 +02:00
tdurieux 2eb19904db fix: anonymize accented and unaccented variants of the same term
When a user added "Davó" to the term list, "Davo" elsewhere in the
content was left untouched (and vice versa). Each term now also runs a
diacritic-insensitive pass: ASCII Latin letters expand to a class
covering common accented siblings, with Unicode-aware lookaround
boundaries so the trailing boundary still fires next to "ó" etc.

Pure helpers moved into src/core/term-matching so the test file can
import them instead of duplicating the logic.

Fixes #280.
2026-05-03 20:18:49 +02:00
tdurieux d8b129c670 fix: anonymize entries when downloading the full repo as a zip
The streaming zip pipeline was constructing AnonymizeTransformer first and
then assigning opt.filePath afterwards. AnonymizeTransformer determines
isText in its constructor from opt.filePath, so every entry was classified
as binary and passed through unchanged — the downloaded zip leaked the
original (un-anonymized) terms even though the web view scrubbed them.

Pass filePath via the constructor so isText is computed correctly.

Fixes #342, #349.
2026-05-03 19:47:10 +02:00
tdurieux 9313c42fcf fix: match terms whose edges are non-word characters
Wrapping every user term as `\b${term}\b` silently dropped matches when
the term started or ended with a non-word char (e.g. `@tdurieux`,
`@author .*`), because JS `\b` only fires at a word/non-word transition.

Replace with `withWordBoundaries()`, which only emits `\b` on the side
where the term has a word-char edge.

Fixes #175, #249.
2026-05-03 18:35:00 +02:00
tdurieux a5f66d6844 multiple fixes 2026-05-03 15:30:54 +02:00
Thomas Durieux 188066e91d Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) 2026-04-15 09:41:00 +02:00
Thomas Durieux 8198a4b44a Standardize error responses with consistent format and human-readable messages (#667) 2026-04-15 09:27:08 +02:00
Thomas Durieux f4209110c7 Fix all 93 ESLint issues (3 errors, 90 warnings) (#666) 2026-04-15 09:04:22 +02:00
Thomas Durieux 655ae92c4c Remove OpenTelemetry tracing infrastructure (#662) 2026-04-15 04:39:08 +02:00
Thomas Durieux f3641c8ce3 Set up CI with ESLint linter and Mocha test runner (#661) 2026-04-15 04:34:03 +02:00
Thomas Durieux b2d77faa6c try to fix repo access 2025-04-01 22:27:41 +02:00
tdurieux f81c63d2af fix: improve perf of getToken 2024-06-19 10:16:38 +02:00
tdurieux 532c094388 fix: improve token management 2024-06-18 12:00:53 +02:00
tdurieux dcf483ea03 feat: improve download anonymized repository 2024-05-06 11:52:32 +02:00
tdurieux ca04339529 feat: list files in folder in webview 2024-05-02 11:49:00 +01:00
tdurieux 17abc47d08 fix: fix webview on root repo 2024-04-28 08:08:39 +01:00
tdurieux 17cb1f294f chore: remove all reference to originalFiles 2024-04-27 18:04:51 +01:00
tdurieux 3d3a03fd04 add index on path and repoId for files 2024-04-27 17:49:13 +01:00
tdurieux 378942a28e fix: fix file list collection 2024-04-27 17:05:39 +01:00
tdurieux 2a145730b7 Improve log and GH token validation 2024-04-27 16:19:33 +01:00
tdurieux 6476899764 fix: fix webview 2024-04-26 14:01:46 +01:00
tdurieux a86e050f8b fix: handle empty repository 2024-04-26 13:48:32 +01:00
tdurieux 8712746e93 feat: check if file list exist when checking if repo is ready 2024-04-26 13:10:09 +01:00
tdurieux b0fa5e6689 fix: hot fix, replace repoID by repoId 2024-04-26 12:40:56 +01:00
tdurieux a9fefcc970 chore: remove console.log 2024-04-26 10:51:43 +01:00
tdurieux 710f7328e7 feat: flatten file tree for better performance 2024-04-26 10:32:09 +01:00
tdurieux f65d167532 fix: use correct hostname for the streamer 2024-04-11 21:38:24 +01:00
tdurieux 6b9574add3 fix: improve repository rename 2024-04-11 17:13:01 +01:00
tdurieux 61c6a79949 feat: check repo status before update 2024-04-11 15:25:45 +01:00
tdurieux 389030adc9 refactor: rely more on the db instead of querying GH 2024-04-06 15:15:08 +01:00
tdurieux 593dbed822 fix: attempt at avoiding double save 2024-04-06 09:33:10 +01:00
tdurieux 80101f83aa fix: fix dns lookup for dnsrr 2024-04-05 14:48:40 +01:00
tdurieux f5d45394bf feat: add sha to file path to avoid caching 2024-04-04 15:56:38 +01:00
tdurieux 3cbf78beb8 fix: fix execution order in removeCache 2024-04-04 08:44:59 +01:00
tdurieux 6e8d006220 fix: fix filePath when anonymized 2024-04-03 18:57:52 +01:00
tdurieux 1d4bab7866 fix: fix webview & improve download progress 2024-04-03 18:25:33 +01:00
tdurieux 83c55fdfbf fix: typo 2024-04-03 13:27:05 +01:00
tdurieux db67f53b2c fix: fix GitHubDownload 2024-04-03 13:24:34 +01:00
tdurieux 4d12641c7e feat: introduce streamers that handle the stream and anonymization from github 2024-04-03 11:13:01 +01:00