anonymous_github

mirror of https://github.com/tdurieux/anonymous_github.git synced 2026-06-30 02:55:30 +02:00

Author	SHA1	Message	Date
tdurieux	f0f6436370	feat: resolve Git LFS pointers via the raw URL endpoint Files tracked by Git LFS used to come out as the pointer text: version https://git-lfs.github.com/spec/v1 oid sha256:... size ... …because GitHub's blob API returns the pointer, not the resolved content. Detect that prefix on the first ~150 bytes of the blob stream and switch to a fresh fetch via the web raw URL (github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects to media.githubusercontent.com and resolves the LFS object — auth header carries through. Non-LFS files are forwarded through the existing pipeline unchanged. Fixes #95.	2026-05-04 12:18:55 +02:00
tdurieux	ef78e8ff3c	feat: preserve raw bytes when anonymization is a no-op When the anonymizer doesn't change a slice's text, the streamer used to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes in the input (replaced by U+FFFD via StringDecoder). Files mistakenly classified as text (binary blobs without a known extension, text with stray non-UTF-8 bytes, BOMs) came out corrupted even though nothing in the term list matched. Track the raw chunk bytes alongside the decoded `pending`. On flush — where we have every byte buffered — emit the original buffer directly when the output equals the input, so a pure passthrough is bit-exact. In the streaming OVERLAP path, do the same when the decode for that slice round-trips losslessly; fall back to encoded output otherwise (unchanged from before for that case). Also add the "missing_content" locale entry for the /api/anonymize-preview route.	2026-05-04 11:52:03 +02:00
tdurieux	4aeecd0fdb	fix: recognize LICENSE / COPYING / etc. as text files istextorbinary returns null for filenames with no extension, and the isTextFile() guard treated null as "not text" — so terms in LICENSE, COPYING, AUTHORS, README (extensionless), CHANGELOG, NOTICE, and similar conventional filenames went through the binary passthrough in AnonymizeTransformer and were never anonymized. Add a small whitelist of these names ahead of the istextorbinary call. Fixes #493.	2026-05-04 11:34:54 +02:00
tdurieux	f91db91cee	wip	2026-05-04 11:30:42 +02:00
tdurieux	652299cda6	fix: follow GitHub repo rename via stored repository id When a user renamed the original GitHub repository, anonymous_github kept calling oct.repos.get({owner, repo}) with the cached old name and got 404, marking the link broken even though the repository still existed at a new path. Recover the numeric GitHub id from the dbModel's externalId ("gh_<id>"). On a 404 from the name-based lookup, fall back to GET /repositories/<id>, which returns the renamed repo. The caller already updates source.repositoryName from r.full_name afterwards. Fixes #409.	2026-05-04 11:15:32 +02:00
tdurieux	117406f2ce	refactor: route anonymize preview through the backend The form's live README/PR preview was running its own copy of ContentAnonimizer in the browser. The two implementations had been drifting — recent fixes for word boundaries (#175/#249), accent matching (#280), custom replacements (#285), and the diacritic-stripped variants only landed on the server. Reviewers saw one anonymization; authors composing the form saw another. Add POST /api/anonymize-preview that takes a snippet (or a batch) plus the user's options and runs them through the same ContentAnonimizer the file route uses. Replace the client-side anonymizeReadme() body with a debounced call to that endpoint. The PR view's anonymizePrContent() runs as a synchronous template expression, so it now reads from a {original -> anonymized} cache that's refreshed in the background whenever the PR details, terms, or options change. Single-flight + debounce keep the form responsive; an in-flight request is dropped on the next change.	2026-05-04 11:05:50 +02:00
tdurieux	3f095f0734	always use githubStream	2026-05-04 10:47:58 +02:00
tdurieux	4bc83db416	feat: per-term anonymization output via term=>replacement A term entered as "Anonymous=>ABC" now scrubs "Anonymous" to "ABC" instead of "XXXX-N". Lets users keep anonymized identifiers valid in source code (no hyphen) and align tokens between paper text and repo. Indexing for default-mask terms is unchanged: a list of "Alpha=>AAA", "Beta" still produces XXXX-2 for Beta. Fixes #285.	2026-05-04 09:31:31 +02:00
tdurieux	a30ab7fb96	fix: don't declare Accept-Ranges: none for binary files The server set Accept-Ranges: none on every file response. For text we anonymize on the fly so byte ranges aren't meaningful, but binary entries pass through unchanged — and the explicit "none" header makes some browsers refuse to play <video>/<audio> elements that would otherwise fall back to a full download. Newly uploaded MP4s under the inline-preview threshold rendered as a blank progress bar (#538). Only set Accept-Ranges: none for text entries; let binary entries omit it so the standard fallback kicks in. Fixes #538.	2026-05-03 21:23:59 +02:00
tdurieux	88fe8570fd	fix: include file path in cache ETag Without the path, two different files in the same repo (same sha, same anonymization options) shared an ETag. If a browser ever sent the cached ETag for one file while requesting another, the server would have returned 304 against the wrong cache entry. Fold the path into the ETag so each file has its own fingerprint. Follow-up to `b3c1030` (#439).	2026-05-03 21:19:39 +02:00
tdurieux	b3c1030e5c	fix: revalidate cached files when anonymization options change Files were being served with Cache-Control: max-age=18144000 (210 days) keyed only on the upstream ?v=<sha>. Editing the term list left the same URL serving stale anonymized bytes — visible to users in regular tabs but not in incognito. The previous fix-by-incognito recipe in #439 is exactly this. Switch to ETag-based revalidation that fingerprints both the upstream sha and the saved anonymization options, with Cache-Control: no-cache, must-revalidate. Browsers now revalidate on every request and get a 304 when nothing has changed, or fresh content as soon as terms, image/link/etc. options are updated. Fixes #439.	2026-05-03 20:31:32 +02:00
tdurieux	2eb19904db	fix: anonymize accented and unaccented variants of the same term When a user added "Davó" to the term list, "Davo" elsewhere in the content was left untouched (and vice versa). Each term now also runs a diacritic-insensitive pass: ASCII Latin letters expand to a class covering common accented siblings, with Unicode-aware lookaround boundaries so the trailing boundary still fires next to "ó" etc. Pure helpers moved into src/core/term-matching so the test file can import them instead of duplicating the logic. Fixes #280.	2026-05-03 20:18:49 +02:00
tdurieux	d8b129c670	fix: anonymize entries when downloading the full repo as a zip The streaming zip pipeline was constructing AnonymizeTransformer first and then assigning opt.filePath afterwards. AnonymizeTransformer determines isText in its constructor from opt.filePath, so every entry was classified as binary and passed through unchanged — the downloaded zip leaked the original (un-anonymized) terms even though the web view scrubbed them. Pass filePath via the constructor so isText is computed correctly. Fixes #342, #349.	2026-05-03 19:47:10 +02:00
tdurieux	9313c42fcf	fix: match terms whose edges are non-word characters Wrapping every user term as `\b${term}\b` silently dropped matches when the term started or ended with a non-word char (e.g. `@tdurieux`, `@author .*`), because JS `\b` only fires at a word/non-word transition. Replace with `withWordBoundaries()`, which only emits `\b` on the side where the term has a word-char edge. Fixes #175, #249.	2026-05-03 18:35:00 +02:00
tdurieux	a5f66d6844	multiple fixes	2026-05-03 15:30:54 +02:00
tdurieux	88f826aab4	update design	2026-04-24 14:55:18 +02:00
Thomas Durieux	188066e91d	Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669 )	2026-04-15 09:41:00 +02:00
Thomas Durieux	8198a4b44a	Standardize error responses with consistent format and human-readable messages (#667 )	2026-04-15 09:27:08 +02:00
Thomas Durieux	f4209110c7	Fix all 93 ESLint issues (3 errors, 90 warnings) (#666 )	2026-04-15 09:04:22 +02:00
Thomas Durieux	95324fd09f	Replace isomorphic-dompurify with sanitize-html for Node 21 compat (#663 )	2026-04-15 04:57:21 +02:00
Thomas Durieux	655ae92c4c	Remove OpenTelemetry tracing infrastructure (#662 )	2026-04-15 04:39:08 +02:00
Thomas Durieux	c6d6806d7a	fix: prevent admin status reset on login (#659 )	2026-04-15 04:34:36 +02:00
Thomas Durieux	f3641c8ce3	Set up CI with ESLint linter and Mocha test runner (#661 )	2026-04-15 04:34:03 +02:00
Thomas Durieux	812f8b6314	Sanitize markdown HTML output with DOMPurify to prevent XSS (#658 )	2026-04-15 04:22:38 +02:00
Thomas Durieux	b2d77faa6c	try to fix repo access	2025-04-01 22:27:41 +02:00
tdurieux	f93eb8787e	fix: protect archive.finalize	2024-07-22 16:31:52 +02:00
tdurieux	d8dd408a65	fix: avoid cache of list of files	2024-07-22 16:20:18 +02:00
tdurieux	f81c63d2af	fix: improve perf of getToken	2024-06-19 10:16:38 +02:00
tdurieux	532c094388	fix: improve token management	2024-06-18 12:00:53 +02:00
tdurieux	dcf483ea03	feat: improve download anonymized repository	2024-05-06 11:52:32 +02:00
tdurieux	93606a5c39	fix: catch error when requesting a folder	2024-05-03 10:49:25 +02:00
tdurieux	ca04339529	feat: list files in folder in webview	2024-05-02 11:49:00 +01:00
tdurieux	ed11e9db36	fix: undefined in path.join	2024-05-02 11:09:20 +01:00
tdurieux	3536f78a99	fix(#283 ): Fix webview for web in folder	2024-04-30 11:28:32 +01:00
tdurieux	3a00a27153	feat: improve support for binary & audio files	2024-04-28 10:01:40 +01:00
tdurieux	72c8f80bce	fix: fix file path in webview	2024-04-28 09:38:49 +01:00
tdurieux	17abc47d08	fix: fix webview on root repo	2024-04-28 08:08:39 +01:00
tdurieux	17cb1f294f	chore: remove all reference to originalFiles	2024-04-27 18:04:51 +01:00
tdurieux	3d3a03fd04	add index on path and repoId for files	2024-04-27 17:49:13 +01:00
tdurieux	378942a28e	fix: fix file list collection	2024-04-27 17:05:39 +01:00
tdurieux	2a145730b7	Improve log and GH token validation	2024-04-27 16:19:33 +01:00
tdurieux	6476899764	fix: fix webview	2024-04-26 14:01:46 +01:00
tdurieux	a86e050f8b	fix: handle empty repository	2024-04-26 13:48:32 +01:00
tdurieux	8712746e93	feat: check if file list exist when checking if repo is ready	2024-04-26 13:10:09 +01:00
tdurieux	b0fa5e6689	fix: hot fix, replace repoID by repoId	2024-04-26 12:40:56 +01:00
tdurieux	a9fefcc970	chore: remove console.log	2024-04-26 10:51:43 +01:00
tdurieux	710f7328e7	feat: flatten file tree for better performance	2024-04-26 10:32:09 +01:00
tdurieux	daf3276f7f	fix: fix queue admin	2024-04-12 09:56:39 +01:00
tdurieux	f65d167532	fix: use correct hostname for the streamer	2024-04-11 21:38:24 +01:00
tdurieux	03835e86ab	fix: handle error in queue admin	2024-04-11 17:48:41 +01:00

1 2 3 4 5 ...

290 Commits