When a GitHub repo is renamed and looked up by its new name, the lookup
by name misses but a record with the same externalId still exists,
causing E11000 on save. Fall back to a lookup by externalId before
creating a new document.
Fixes#500
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Files tracked by Git LFS used to come out as the pointer text:
version https://git-lfs.github.com/spec/v1
oid sha256:...
size ...
…because GitHub's blob API returns the pointer, not the resolved
content. Detect that prefix on the first ~150 bytes of the blob stream
and switch to a fresh fetch via the web raw URL
(github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects
to media.githubusercontent.com and resolves the LFS object — auth
header carries through. Non-LFS files are forwarded through the
existing pipeline unchanged.
Fixes#95.
When the anonymizer doesn't change a slice's text, the streamer used
to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes
in the input (replaced by U+FFFD via StringDecoder). Files
mistakenly classified as text (binary blobs without a known extension,
text with stray non-UTF-8 bytes, BOMs) came out corrupted even though
nothing in the term list matched.
Track the raw chunk bytes alongside the decoded `pending`. On flush —
where we have every byte buffered — emit the original buffer directly
when the output equals the input, so a pure passthrough is bit-exact.
In the streaming OVERLAP path, do the same when the decode for that
slice round-trips losslessly; fall back to encoded output otherwise
(unchanged from before for that case).
Also add the "missing_content" locale entry for the
/api/anonymize-preview route.
istextorbinary returns null for filenames with no extension, and the
isTextFile() guard treated null as "not text" — so terms in LICENSE,
COPYING, AUTHORS, README (extensionless), CHANGELOG, NOTICE, and
similar conventional filenames went through the binary passthrough
in AnonymizeTransformer and were never anonymized.
Add a small whitelist of these names ahead of the istextorbinary call.
Fixes#493.
When a user renamed the original GitHub repository, anonymous_github
kept calling oct.repos.get({owner, repo}) with the cached old name and
got 404, marking the link broken even though the repository still
existed at a new path.
Recover the numeric GitHub id from the dbModel's externalId
("gh_<id>"). On a 404 from the name-based lookup, fall back to
GET /repositories/<id>, which returns the renamed repo. The caller
already updates source.repositoryName from r.full_name afterwards.
Fixes#409.
A term entered as "Anonymous=>ABC" now scrubs "Anonymous" to "ABC"
instead of "XXXX-N". Lets users keep anonymized identifiers valid in
source code (no hyphen) and align tokens between paper text and repo.
Indexing for default-mask terms is unchanged: a list of
"Alpha=>AAA", "Beta" still produces XXXX-2 for Beta.
Fixes#285.
The server set Accept-Ranges: none on every file response. For text we
anonymize on the fly so byte ranges aren't meaningful, but binary
entries pass through unchanged — and the explicit "none" header makes
some browsers refuse to play <video>/<audio> elements that would
otherwise fall back to a full download. Newly uploaded MP4s under the
inline-preview threshold rendered as a blank progress bar (#538).
Only set Accept-Ranges: none for text entries; let binary entries omit
it so the standard fallback kicks in.
Fixes#538.
When a user added "Davó" to the term list, "Davo" elsewhere in the
content was left untouched (and vice versa). Each term now also runs a
diacritic-insensitive pass: ASCII Latin letters expand to a class
covering common accented siblings, with Unicode-aware lookaround
boundaries so the trailing boundary still fires next to "ó" etc.
Pure helpers moved into src/core/term-matching so the test file can
import them instead of duplicating the logic.
Fixes#280.
The streaming zip pipeline was constructing AnonymizeTransformer first and
then assigning opt.filePath afterwards. AnonymizeTransformer determines
isText in its constructor from opt.filePath, so every entry was classified
as binary and passed through unchanged — the downloaded zip leaked the
original (un-anonymized) terms even though the web view scrubbed them.
Pass filePath via the constructor so isText is computed correctly.
Fixes#342, #349.
Wrapping every user term as `\b${term}\b` silently dropped matches when
the term started or ended with a non-word char (e.g. `@tdurieux`,
`@author .*`), because JS `\b` only fires at a word/non-word transition.
Replace with `withWordBoundaries()`, which only emits `\b` on the side
where the term has a word-char edge.
Fixes#175, #249.