anonymous_github

mirror of https://github.com/tdurieux/anonymous_github.git synced 2026-06-30 11:05:33 +02:00

Author	SHA1	Message	Date
tdurieux	1254c56223	fix: dedupe folder contents on re-fetch (#701 ) getFiles blindly appended fetched entries to $scope.files, so re-opening a folder duplicated its children in the tree. Drop any existing entries at the requested path before appending.	2026-05-06 11:16:12 +03:00
tdurieux	c2d43164d0	error logging improvement, regex fix	2026-05-06 11:16:12 +03:00
dependabot[bot]	e34f45522f	chore(deps): bump uuid and bullmq (#691 ) * chore(deps): bump uuid and bullmq Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependency [bullmq](https://github.com/taskforcesh/bullmq). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 2.4.0 to 5.76.5 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](https://github.com/taskforcesh/bullmq/compare/v2.4.0...v5.76.5) --- updated-dependencies: - dependency-name: bullmq dependency-version: 5.76.5 dependency-type: direct:production - dependency-name: uuid dependency-version: dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> * fix(bullmq): adapt isRunning + getJobs typing for v5 API Worker.isRunning became a method (was a property in v2), and Queue.getJobs now requires a mutable JobType[] (was string[]). * clean up --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tdurieux <durieuxthomas@hotmail.com> Co-authored-by: Thomas Durieux <5577568+tdurieux@users.noreply.github.com>	2026-05-06 08:40:12 +02:00
tdurieux	2528f42b59	add tests	2026-05-06 08:23:22 +03:00
tdurieux	7e0f900c7a	fix: resolve eslint unused-var and useless-assignment warnings Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 08:22:45 +03:00
dependabot[bot]	e306adcc97	chore(deps): bump ip-address and socks (#697 ) Bumps [ip-address](https://github.com/beaugunderson/ip-address) and [socks](https://github.com/JoshGlazebrook/socks). These dependencies needed to be updated together. Updates `ip-address` from 9.0.5 to 10.2.0 - [Commits](https://github.com/beaugunderson/ip-address/commits) Updates `socks` from 2.8.4 to 2.8.8 - [Release notes](https://github.com/JoshGlazebrook/socks/releases) - [Commits](https://github.com/JoshGlazebrook/socks/compare/2.8.4...2.8.8) --- updated-dependencies: - dependency-name: ip-address dependency-version: 10.2.0 dependency-type: indirect - dependency-name: socks dependency-version: 2.8.8 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-06 07:21:33 +02:00
dependabot[bot]	78ff8eaa9b	chore(deps): bump mongoose from 7.8.4 to 7.8.9 (#698 ) Bumps [mongoose](https://github.com/Automattic/mongoose) from 7.8.4 to 7.8.9. - [Release notes](https://github.com/Automattic/mongoose/releases) - [Changelog](https://github.com/Automattic/mongoose/blob/master/CHANGELOG.md) - [Commits](https://github.com/Automattic/mongoose/commits) --- updated-dependencies: - dependency-name: mongoose dependency-version: 7.8.9 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-06 07:21:17 +02:00
tdurieux	79f555769d	improve binary file detection: content sniffing + jsonl support Files like .jsonl that mime-types doesn't know fell through to application/octet-stream and rendered as "Unsupported binary file" in the viewer. Replace istextorbinary with isbinaryfile for content-based detection, and use mime-types for name-based classification with a textual application/* allowlist. The streaming transformer now defers classification when the name is inconclusive and sniffs the first chunk before emitting "transform", so route.ts and AnonymizedFile.ts get a content-aware Content-Type. Whitelists .jsonl and .ndjson to short-circuit dataset files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 07:52:48 +03:00
tdurieux	18ce39e019	make repo error messages actionable (#550 , #379 , #519 ) Rewrite repo_access_limited, repo_not_found, repo_empty, and repo_not_accessible to point users at concrete next steps instead of stating only that something failed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 07:50:41 +03:00
tdurieux	27d6b56da7	fix mulitple bugs	2026-05-05 12:34:03 +03:00
tdurieux	7384638993	dark theme support	2026-05-05 10:40:12 +03:00
tdurieux	c903627da0	fix build	2026-05-05 10:35:08 +03:00
tdurieux	f8c91ca0af	multiple fixes	2026-05-05 10:32:31 +03:00
tdurieux	5b72b630c4	fix: silent-truncation, token-refresh, and content-type bugs across hot paths Follow-up review pass after the cache fixes turned up several bugs in the same family — silent failures that look like success to the client, plus content-correctness issues in the ZIP and per-file delivery paths. - zipStream: stop calling archive.finalize() on upstream/parser errors. That produced a valid-looking ZIP (200 OK, archive opens) silently missing entries — same class as #694, but worse because the user has no signal anything went wrong. Destroy the response on failure instead so the client sees a connection drop. - zipStream: apply per-repo image/pdf gates inside the entry handler. The single-file /file/... endpoint refuses to serve those types via AnonymizedFile.isFileSupported when image=false / pdf=false, but the ZIP shipped them anyway — privacy-relevant for maintainers who toggle image=false to suppress identifying screenshots. Threaded contentOptions through both ZIP entry points (direct and streamer). - GitHubUtils.getToken: validate the OAuth token-refresh response before persisting. On a non-2xx response or a body without a string token, we used to overwrite the stored token with `undefined`, which then propagated as `Authorization: token undefined` to every API call — 401 even on public repos, with the config.GITHUB_TOKEN fallback unreachable because the field was no longer falsy. - AnonymizedFile.send (streamer branch): forward Content-Type from the upstream streamer response. got.stream(...).pipe(res) carries body bytes only, so the parent response had no Content-Type and browsers guessed (text rendered as download, etc.). Also resolve on res.on("finish") in addition to "close" — keep-alive sockets stay open long after the response is delivered, delaying countView(). - Repository.updateIfNeeded: persist a renamed source.repositoryName even when the commit hasn't changed. Previously the new value lived in memory only and was overwritten on the next reload, so the rename detection ran every request. - Repository.anonymize: stop materialising a dummy {path:"",name:""} FileModel for empty repos. That row collided with the special case in AnonymizedFile.getFileInfo and surfaced in unfiltered listings. - streamer/route POST /: reject filePath segments containing ".." or empty parts. Defence in depth — the parent server validates against FileModel before calling, but the streamer joins filePath straight into the storage path, so any future caller forwarding an unvalidated path could traverse out of the repo root. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 09:19:05 +03:00
tdurieux	f413a30313	fix(cache): make Zip-source caches atomic and robust to partial state Follow-up to the GitHubStream cache fixes. The same poisoned-cache class existed in the GitHubDownload path and a few related spots: - GitHubDownload.download: wipe pre-existing state before extracting and write a .anon-complete marker only after a successful extract. On error, rm the partial cache so a retry starts clean. getFileContent and getFiles now gate on the marker instead of "any file/folder exists," so a half-extracted tree can never be served as canonical. - GitHubDownload.getFileContent: validate cached file size against the upstream FileModel size (via the new AnonymizedFile.size()), same guard as GitHubStream. getFiles filters the marker from the listing. - FileSystem.listFiles: drop the bogus stats.ino.toString() as sha. An inode isn't a content hash; anything comparing it to a Git blob sha would silently disagree. Leave undefined. - S3.write: remove the fire-and-forget data.on("error") -> this.rm(...). Multipart Upload doesn't commit partial objects, so there was nothing to clean up, and the handler raced retries and could delete a previously-good object on a transient source-stream hiccup. The size-validated read path recovers from any other undersized objects. - GitHubStream.resolveLfsPointer: drop the post-decision early-return in blobStream.on("error"). Currently redundant with the inner listener, but removes the future-refactor footgun. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 08:54:42 +03:00
tdurieux	9adff11e74	fix(cache): atomic file writes and size-validated cache reads A failed/interrupted GitHub fetch could leave a 0-byte or truncated file in the local cache. Subsequent reads happily streamed the empty content as the file's body — visible to users as an "Empty file" with HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694). - FileSystem.write: stream into a sibling .tmp and rename into place only on finish. Stream errors discard the tmp and leave any prior cached file untouched. Drop the utf-8 encoding that was silently corrupting binary blobs. - GitHubStream.getFileContentCache: accept an expected size and treat cached.size < expected as a poisoned cache (truncated fetch) → rm and re-fetch. cached.size >= expected is accepted, which keeps Git LFS-resolved files (whose FileModel.size is the pointer size) working. - AnonymizedFile: expose size() and pass it through to the streamer alongside sha so the cache check has the upstream size. Existing poisoned entries self-heal on next access. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 08:47:41 +03:00
tdurieux	53959f677c	fix(repo): reuse existing record on rename to avoid externalId duplicate key When a GitHub repo is renamed and looked up by its new name, the lookup by name misses but a record with the same externalId still exists, causing E11000 on save. Fall back to a lookup by externalId before creating a new document. Fixes #500 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 00:59:38 +03:00
tdurieux	1dde5dc308	docker: multi-stage build with alpine runtime Split into builder (node:21-slim, full deps + tsc/gulp) and runtime (node:21-alpine, production deps only). Drops ~hundreds of MB from the published image and removes dev tooling from the runtime layer.	2026-05-05 00:52:55 +03:00
tdurieux	ab266f2188	fix(i18n): add missing error translations for gist, auth, and co-author flows Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 00:50:40 +03:00
tdurieux	4cff32ec93	ci: build only linux/amd64 to avoid arm64 QEMU npm flakiness	2026-05-05 00:48:14 +03:00
tdurieux	0cfa95af77	ci: add Docker Hub publish workflow Builds and pushes tdurieux/anonymous_github to Docker Hub on push to main, v* tags, and manual dispatch. Multi-arch (amd64/arm64) with GHA layer caching. Closes #478.	2026-05-05 00:44:57 +03:00
tdurieux	dee406e2ea	update design	2026-05-05 00:36:42 +03:00
tdurieux	49b124e188	fix regex	2026-05-05 00:07:49 +03:00
tdurieux	f0bc53f093	feat: gist & co-authors	2026-05-04 13:10:44 +02:00
tdurieux	f0f6436370	feat: resolve Git LFS pointers via the raw URL endpoint Files tracked by Git LFS used to come out as the pointer text: version https://git-lfs.github.com/spec/v1 oid sha256:... size ... …because GitHub's blob API returns the pointer, not the resolved content. Detect that prefix on the first ~150 bytes of the blob stream and switch to a fresh fetch via the web raw URL (github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects to media.githubusercontent.com and resolves the LFS object — auth header carries through. Non-LFS files are forwarded through the existing pipeline unchanged. Fixes #95.	2026-05-04 12:18:55 +02:00
tdurieux	7ace730960	fix: fall back to default branch when resolving relative image URLs The Anonymize form's preview built the readme baseUrl as "https://github.com/<owner>/<repo>/raw/<source.branch>/". When the form rendered before the branch field had populated (initial load, or while waiting on getBranches), the URL became ".../raw//" and the browser collapsed the empty segment, fetching ".../raw/<file>" instead of ".../raw/<branch>/<file>". Relative <img src="./X"> references then 404'd against a path with no branch — exactly the "branch missing" pattern in #407. Fall back to details.defaultBranch (then "main") so the base URL is always well-formed. Fixes #407.	2026-05-04 12:01:56 +02:00
tdurieux	62a2c1cd5c	fix: force-fetch repo details so newly enabled features show up The Anonymize form used the cached RepositoryModel for hasPage, defaultBranch, etc. — so enabling GitHub Pages (or changing the default branch) on the source after first cache wouldn't reflect in the UI, leaving the GitHub Pages checkbox grayed out. Pass force=1 when loading the form's repo details so the backend re-queries the GitHub API once. The cost is a single GET /repos/... call per form load. Fixes #364.	2026-05-04 11:58:44 +02:00
tdurieux	ef78e8ff3c	feat: preserve raw bytes when anonymization is a no-op When the anonymizer doesn't change a slice's text, the streamer used to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes in the input (replaced by U+FFFD via StringDecoder). Files mistakenly classified as text (binary blobs without a known extension, text with stray non-UTF-8 bytes, BOMs) came out corrupted even though nothing in the term list matched. Track the raw chunk bytes alongside the decoded `pending`. On flush — where we have every byte buffered — emit the original buffer directly when the output equals the input, so a pure passthrough is bit-exact. In the streaming OVERLAP path, do the same when the decode for that slice round-trips losslessly; fall back to encoded output otherwise (unchanged from before for that case). Also add the "missing_content" locale entry for the /api/anonymize-preview route.	2026-05-04 11:52:03 +02:00
tdurieux	4aeecd0fdb	fix: recognize LICENSE / COPYING / etc. as text files istextorbinary returns null for filenames with no extension, and the isTextFile() guard treated null as "not text" — so terms in LICENSE, COPYING, AUTHORS, README (extensionless), CHANGELOG, NOTICE, and similar conventional filenames went through the binary passthrough in AnonymizeTransformer and were never anonymized. Add a small whitelist of these names ahead of the istextorbinary call. Fixes #493.	2026-05-04 11:34:54 +02:00
tdurieux	f91db91cee	wip	2026-05-04 11:30:42 +02:00
tdurieux	96d9f3cee6	fix: show explorer sidebar 'Updated' timestamp on desktop The .leftCol-foot was hidden by default and only revealed inside the mobile/tablet media query, so the last-update line never appeared on wider viewports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-04 11:28:10 +02:00
tdurieux	03826fe58b	fix: don't break folder lazy-load and file navigation Two regressions stacked from the recent tree work: 1. expandAllFolders (#496) was marking every folder open, including folders whose children weren't fetched yet. The directive then rendered an empty <ul> after each <a>, and the openFolder handler's "no sibling means we need to load" check silently treated the empty <ul> as already-loaded — so clicking the folder toggled the class but the children never appeared. Skip folders with empty children when pre-expanding, and harden the click handler so an empty <ul> still triggers a fetch. 2. The $routeUpdate handler (#510 follow-up) became async and called $scope.$apply(updateContent) at the end. Inside an already-running digest cycle this no-ops or throws, leaving file navigation stuck. Run updateContent() synchronously like before, and kick off any missing parent-directory fetches in the background — getContent() already falls back to sha "0" when the metadata isn't loaded yet.	2026-05-04 11:25:32 +02:00
tdurieux	652299cda6	fix: follow GitHub repo rename via stored repository id When a user renamed the original GitHub repository, anonymous_github kept calling oct.repos.get({owner, repo}) with the cached old name and got 404, marking the link broken even though the repository still existed at a new path. Recover the numeric GitHub id from the dbModel's externalId ("gh_<id>"). On a 404 from the name-based lookup, fall back to GET /repositories/<id>, which returns the renamed repo. The caller already updates source.repositoryName from r.full_name afterwards. Fixes #409.	2026-05-04 11:15:32 +02:00
tdurieux	5c7eb23336	fix: don't crash and stay stuck on a subdir markdown link Clicking a markdown link into a subdirectory's README threw "Cannot read properties of undefined (reading 'sha')" and left the viewer on Loading…. The route handler called updateContent() without loading the new directory's file listing, so getSelectedFile() returned undefined and getContent() then dereferenced fileInfo.sha. Two fixes: - getContent() falls back to sha "0" when fileInfo is undefined. - The $routeUpdate handler walks the new path and loads any directory listings that aren't yet in $scope.files before rendering, so the selected file actually has its sha by the time we fetch. Fixes #510.	2026-05-04 11:10:12 +02:00
tdurieux	117406f2ce	refactor: route anonymize preview through the backend The form's live README/PR preview was running its own copy of ContentAnonimizer in the browser. The two implementations had been drifting — recent fixes for word boundaries (#175/#249), accent matching (#280), custom replacements (#285), and the diacritic-stripped variants only landed on the server. Reviewers saw one anonymization; authors composing the form saw another. Add POST /api/anonymize-preview that takes a snippet (or a batch) plus the user's options and runs them through the same ContentAnonimizer the file route uses. Replace the client-side anonymizeReadme() body with a debounced call to that endpoint. The PR view's anonymizePrContent() runs as a synchronous template expression, so it now reads from a {original -> anonymized} cache that's refreshed in the background whenever the PR details, terms, or options change. Single-flight + debounce keep the form responsive; an in-flight request is dropped on the next change.	2026-05-04 11:05:50 +02:00
tdurieux	c8fc561dac	fix: regex characters in terms shouldn't block submission Entering an IP address (e.g. 192.168.1.1) or any term with regex metacharacters made the form invalid because the "regex characters detected" hint was wired up via $setValidity('terms', 'regex', false). The text in the UI labels it as a warning, but the form treated it as an error and refused to save. Track the warning as a plain $scope flag and show it via ng-show on that flag, so the form stays valid (#430).	2026-05-04 10:58:17 +02:00
tdurieux	3f095f0734	always use githubStream	2026-05-04 10:47:58 +02:00
tdurieux	59d9805276	update titles	2026-05-04 09:33:44 +02:00
tdurieux	4bc83db416	feat: per-term anonymization output via term=>replacement A term entered as "Anonymous=>ABC" now scrubs "Anonymous" to "ABC" instead of "XXXX-N". Lets users keep anonymized identifiers valid in source code (no hyphen) and align tokens between paper text and repo. Indexing for default-mask terms is unchanged: a list of "Alpha=>AAA", "Beta" still produces XXXX-2 for Beta. Fixes #285.	2026-05-04 09:31:31 +02:00
tdurieux	db2ac5307d	fix admin	2026-05-03 22:29:01 +02:00
tdurieux	6096cb0744	fix: expand every folder by default in the explorer tree The file tree opened collapsed, requiring the reviewer to click each folder before they could see what was inside. Walk the tree on first render and mark every folder open in $scope.opens. Folders the user has explicitly toggled (a previous entry already exists in $scope.opens) are left as-is, so collapsing still works. Fixes #496.	2026-05-03 22:28:27 +02:00
tdurieux	a30ab7fb96	fix: don't declare Accept-Ranges: none for binary files The server set Accept-Ranges: none on every file response. For text we anonymize on the fly so byte ranges aren't meaningful, but binary entries pass through unchanged — and the explicit "none" header makes some browsers refuse to play <video>/<audio> elements that would otherwise fall back to a full download. Newly uploaded MP4s under the inline-preview threshold rendered as a blank progress bar (#538). Only set Accept-Ranges: none for text entries; let binary entries omit it so the standard fallback kicks in. Fixes #538.	2026-05-03 21:23:59 +02:00
tdurieux	88fe8570fd	fix: include file path in cache ETag Without the path, two different files in the same repo (same sha, same anonymization options) shared an ETag. If a browser ever sent the cached ETag for one file while requesting another, the server would have returned 304 against the wrong cache entry. Fold the path into the ETag so each file has its own fingerprint. Follow-up to `b3c1030` (#439).	2026-05-03 21:19:39 +02:00
tdurieux	1f966841ad	docs: refresh home/dashboard/anonymize/explorer screenshots Replace the outdated user-images.githubusercontent.com screenshot in the README with a locally hosted image, and regenerate the in-app screenshots to reflect the current UI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-03 20:32:45 +02:00
tdurieux	b3c1030e5c	fix: revalidate cached files when anonymization options change Files were being served with Cache-Control: max-age=18144000 (210 days) keyed only on the upstream ?v=<sha>. Editing the term list left the same URL serving stale anonymized bytes — visible to users in regular tabs but not in incognito. The previous fix-by-incognito recipe in #439 is exactly this. Switch to ETag-based revalidation that fingerprints both the upstream sha and the saved anonymization options, with Cache-Control: no-cache, must-revalidate. Browsers now revalidate on every request and get a 304 when nothing has changed, or fresh content as soon as terms, image/link/etc. options are updated. Fixes #439.	2026-05-03 20:31:32 +02:00
tdurieux	b316d18bd8	fix: clicking a gutter line updates the URL to #L<n> The viewer already supported jumping to a line via #L42 in the URL but never produced one — users had to type it manually. Wire guttermousedown on the ACE editor to replaceState a #L<n> hash, with shift-click for a range. Also reapply the highlight on hashchange so pasting a URL into the address bar works without reload. Fixes #392.	2026-05-03 20:25:41 +02:00
tdurieux	2eb19904db	fix: anonymize accented and unaccented variants of the same term When a user added "Davó" to the term list, "Davo" elsewhere in the content was left untouched (and vice versa). Each term now also runs a diacritic-insensitive pass: ASCII Latin letters expand to a class covering common accented siblings, with Unicode-aware lookaround boundaries so the trailing boundary still fires next to "ó" etc. Pure helpers moved into src/core/term-matching so the test file can import them instead of duplicating the logic. Fixes #280.	2026-05-03 20:18:49 +02:00
tdurieux	d138d487f2	fix: don't drop a path segment when resolving "./" relative URLs urlRel2abs() prepended an extra "." when it saw "./X", turning the relative path into "../X" and silently moving up a directory. As a result, raw HTML <img src="./imgs/run.png"> inside a README rendered under /r/<repo>/<file> resolved to /r/<repo>/imgs/... instead of /r/<repo>/<dir>/imgs/..., so the image 404'd. Markdown image syntax went through marked-base-url and was unaffected. Strip the leading "./" instead so the relative path concatenates cleanly with baseUrl. Fixes #346.	2026-05-03 20:01:33 +02:00
tdurieux	d8b129c670	fix: anonymize entries when downloading the full repo as a zip The streaming zip pipeline was constructing AnonymizeTransformer first and then assigning opt.filePath afterwards. AnonymizeTransformer determines isText in its constructor from opt.filePath, so every entry was classified as binary and passed through unchanged — the downloaded zip leaked the original (un-anonymized) terms even though the web view scrubbed them. Pass filePath via the constructor so isText is computed correctly. Fixes #342, #349.	2026-05-03 19:47:10 +02:00
tdurieux	9feeab1055	fix: render heading IDs so anchor links resolve marked v12 dropped the headerIds option, so headings rendered with no id attributes and links like [Releases](#releases-and-contributing) silently failed to scroll. Add a heading renderer that emits a GitHub-style slug id, with a numeric suffix for duplicates within a document. Fixes #390.	2026-05-03 19:44:18 +02:00

1 2 3 4 5 ...

596 Commits