CalvinBackup/anonymous_github - anonymous_github - MS-GitHub-Backup (Gitea)

CalvinBackup/anonymous_github

mirror of https://github.com/tdurieux/anonymous_github.git synced 2026-05-15 14:38:03 +02:00

Author	SHA1	Message	Date
tdurieux	79f555769d	improve binary file detection: content sniffing + jsonl support Files like .jsonl that mime-types doesn't know fell through to application/octet-stream and rendered as "Unsupported binary file" in the viewer. Replace istextorbinary with isbinaryfile for content-based detection, and use mime-types for name-based classification with a textual application/* allowlist. The streaming transformer now defers classification when the name is inconclusive and sniffs the first chunk before emitting "transform", so route.ts and AnonymizedFile.ts get a content-aware Content-Type. Whitelists .jsonl and .ndjson to short-circuit dataset files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 07:52:48 +03:00
tdurieux	4bc83db416	feat: per-term anonymization output via term=>replacement A term entered as "Anonymous=>ABC" now scrubs "Anonymous" to "ABC" instead of "XXXX-N". Lets users keep anonymized identifiers valid in source code (no hyphen) and align tokens between paper text and repo. Indexing for default-mask terms is unchanged: a list of "Alpha=>AAA", "Beta" still produces XXXX-2 for Beta. Fixes #285.	2026-05-04 09:31:31 +02:00
tdurieux	2eb19904db	fix: anonymize accented and unaccented variants of the same term When a user added "Davó" to the term list, "Davo" elsewhere in the content was left untouched (and vice versa). Each term now also runs a diacritic-insensitive pass: ASCII Latin letters expand to a class covering common accented siblings, with Unicode-aware lookaround boundaries so the trailing boundary still fires next to "ó" etc. Pure helpers moved into src/core/term-matching so the test file can import them instead of duplicating the logic. Fixes #280.	2026-05-03 20:18:49 +02:00
tdurieux	d8b129c670	fix: anonymize entries when downloading the full repo as a zip The streaming zip pipeline was constructing AnonymizeTransformer first and then assigning opt.filePath afterwards. AnonymizeTransformer determines isText in its constructor from opt.filePath, so every entry was classified as binary and passed through unchanged — the downloaded zip leaked the original (un-anonymized) terms even though the web view scrubbed them. Pass filePath via the constructor so isText is computed correctly. Fixes #342, #349.	2026-05-03 19:47:10 +02:00
tdurieux	9313c42fcf	fix: match terms whose edges are non-word characters Wrapping every user term as `\b${term}\b` silently dropped matches when the term started or ended with a non-word char (e.g. `@tdurieux`, `@author .*`), because JS `\b` only fires at a word/non-word transition. Replace with `withWordBoundaries()`, which only emits `\b` on the side where the term has a word-char edge. Fixes #175, #249.	2026-05-03 18:35:00 +02:00
tdurieux	a5f66d6844	multiple fixes	2026-05-03 15:30:54 +02:00
Thomas Durieux	188066e91d	Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669 )	2026-04-15 09:41:00 +02:00