anonymous_github

mirror of https://github.com/tdurieux/anonymous_github.git synced 2026-06-29 18:50:00 +02:00

Author	SHA1	Message	Date
Thomas Durieux	e4ffd74068	Security hardening + gist UI fixes (#731 ) * security: harden against XSS, ReDoS, path traversal, and injection Defensive fixes across the server, storage, and viewer: - XSS (CWE-79): sanitise rendered notebooks with DOMPurify, escape file names interpolated into AngularJS expressions (escapeNgString), set Mermaid securityLevel to 'strict', and stop urlRel2abs from returning javascript:/vbscript:/data:text/html URLs. - Path traversal / zip-slip (CWE-22/23/24): validate URL-derived path components before they reach the storage layer (file/webview routes + StorageBase.assertSafePath) and sanitise zip entry names on extract for both the filesystem and S3 backends. - ReDoS (CWE-1333): escape anonymization terms with catastrophic backtracking shapes to literals instead of compiling them as regexes. - Secret hardening (CWE-798): require SESSION_SECRET / OAuth creds / DB password in production, random dev SESSION_SECRET fallback. - Rate-limit spoofing (CWE-290): derive request.ip via trust-proxy hop count instead of the client-settable cf-connecting-ip header. - NoSQL injection (CWE-943): allow only plain field paths as admin sort keys. - Reject malformed streamer requests missing required string fields. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(ui): make gists reachable/visible and clarify the ZIP button - Gist & PR routes now accept a trailing slash (/gist/:id/:path?), so the dashboard links (which end in "/") resolve to the gist/PR page instead of falling through to the 404 route (#725). - Gist viewer picks the default tab after content loads, defaulting to "files" when files exist; previously the ng-init ran before the async load and a files-only gist rendered blank under the hidden comments tab. - Explorer toolbar: relabel ZIP to "Full repo ZIP" with a tooltip, and add tooltips to Raw/Download clarifying they apply to the current file (#721). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> fix: report SAML-enforced orgs clearly instead of "token expired" When a repo's organization enforces SAML SSO, GitHub returns a 403 whose message differs from the OAuth-App-restriction case. That 403 fell through to the generic handler and surfaced as "token_expired", pushing users to re-login when the real fix is authorizing their token for the org. Detect the "SAML enforcement" message and raise a dedicated, actionable error instead (#379, #550). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * security: catch nested quantified groups in ReDoS guard and backslash path traversal - hasCatastrophicBacktracking now scans across nested parens ([\s\S]?) so shapes like ((a+))+ are detected; comment reframed as a heuristic backstop rather than a proof. - file route path-traversal check now rejects backslash separators and a leading backslash, covering Windows-style "..\" payloads (CWE-22/25). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> chore(dev): track dev-proxy script, ignore .DS_Store and .claude/ scripts/dev-proxy.js is referenced by the "dev:ui" npm script but was never committed, breaking the command on a fresh clone. Add it and ignore local-only macOS/Claude Code files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:50:55 +02:00
tdurieux	39fadd6cf0	fix: auth issue & profile save issue	2026-05-18 16:04:09 +03:00
tdurieux	e73ad48115	fix: file download large repo	2026-05-12 21:00:47 +03:00
tdurieux	898f18919e	fix rate limit	2026-05-12 20:51:08 +03:00
tdurieux	427e26062e	fix	2026-05-12 20:34:32 +03:00
tdurieux	03e18fd572	repo change + daily stat improvements	2026-05-11 12:10:17 +03:00
tdurieux	3eeed23609	handle memory issues	2026-05-07 21:01:07 +03:00
tdurieux	369fd8edb2	redis cache	2026-05-07 15:55:28 +03:00
tdurieux	b37a814f3a	improve queue	2026-05-07 14:58:36 +03:00
tdurieux	f817a29a4b	loading improvements	2026-05-07 08:30:31 +03:00
tdurieux	2de08c3df3	Add missing error handlers on stream pipelines - AnonymizedFile.anonymizedContent(): propagate content errors to the anonymizer so callers see the failure instead of hanging. - AnonymizedFile.send() local path: add error handler on the anonymizer transform between content and response pipes. - S3.send(): handle errors on the S3 body stream to avoid unhandled emits crashing the process. - S3.archive() / FileSystem.archive(): propagate read-stream errors to the file transformer so archiver sees the failure. - Add frontend translations for new error codes.	2026-05-07 07:47:29 +03:00
tdurieux	7a163f2d35	Fix streamer crash and misclassified transient GitHub errors Add missing error handler on the anonymizer transform stream in the streamer route — without it, an upstream error tears down the pipe and the anonymizer emits an unhandled error that crashes the process (surfacing as ECONNRESET to the main server). Classify transient network errors (ReadError, ECONNRESET, ETIMEDOUT) as upstream_error/502 instead of file_not_found/404 so they are distinguishable in logs and don't cache-poison downstream. Update handleError tests to match the existing sanitization behavior that returns internal_error for non-AnonymousError instances.	2026-05-07 07:44:15 +03:00
tdurieux	4ab8e0d1cd	Handle GitHub 409 "repository is empty" error in getCommitInfo When a GitHub repo has no commits, the API returns 409 which was unhandled, causing raw HttpError warnings. Now throws repo_empty AnonymousError consistent with the existing convention.	2026-05-07 07:42:05 +03:00
tdurieux	fbbc694747	improve styling	2026-05-07 07:34:30 +03:00
tdurieux	e59527bc78	Remove all user repositories when banning Use removeQueue instead of cacheQueue so each repo transitions to REMOVING status and is fully deleted, not just cache-cleared.	2026-05-07 06:04:55 +03:00
tdurieux	9292c19392	Handle GitHub 422 errors as commit_not_found and sanitize error responses	2026-05-07 05:54:51 +03:00
tdurieux	9403f15ac3	Align error logging with admin dashboard field conventions All warn/error log calls now use field names the dashboard's decorate() function recognizes: `code` for the error code pill, `httpStatus` for the status badge and severity bucket, `url` for the sidebar link, and `repoId` for the repository link. Key changes: - Streamer errors surface code, httpStatus, url, and nested err in Raw tab - Nested `{ err: serializeError(e) }` replaced with spread pattern so error fields (name, message, status) appear at the top level - Raw Error objects in catch blocks now go through serializeError() - Rate limit, token, and PR 404 warnings include code + httpStatus - Dashboard stack walker traverses both `cause` and `err` chains - Dashboard Raw tab renders repoId, filePath, upstream*, err, and cause - trimRawArg recursively trims stacks in nested err/cause chains - clampPayload strips heavy nested fields before falling back to truncated placeholder, preserving flat diagnostic fields	2026-05-07 05:54:18 +03:00
tdurieux	b8cfe293ea	Fix BullMQ "Custom Id cannot be integers" error by prefixing jobId	2026-05-07 05:53:26 +03:00
tdurieux	8fc7ac5175	Add user ban/activate feature Add admin endpoints to ban and activate users, block banned users from all auth flows (OAuth, token login, bearer auth), and invalidate existing sessions on next request. Includes frontend translation and user detail page ban/activate buttons.	2026-05-07 05:41:12 +03:00
tdurieux	d9104c2ec2	Update commit on branch refresh and validate commit exists on save Refresh button now always updates the commit to the latest SHA instead of preserving the stale one in edit mode. Both create and update routes verify the commit still exists on GitHub before persisting.	2026-05-06 21:14:53 +03:00
tdurieux	bd8656206a	fix persistance bugs	2026-05-06 20:00:59 +03:00
tdurieux	67cb2538b1	fix old github download repos	2026-05-06 19:37:16 +03:00
tdurieux	da78708b7b	Improve error handling	2026-05-06 18:43:36 +03:00
tdurieux	aae6eae6eb	handle rate limit	2026-05-06 17:50:01 +03:00
tdurieux	6bad6c2f09	fix bugs and report better errors	2026-05-06 17:26:47 +03:00
tdurieux	3b27816702	fix incremental	2026-05-06 17:12:58 +03:00
tdurieux	804bbffb7a	Improve error handling	2026-05-06 17:03:19 +03:00
tdurieux	48e782946a	Improve error handling	2026-05-06 16:56:07 +03:00
tdurieux	cf2f172aca	fix(gist): set gist subpaths individually to avoid CastError Mongoose treats `gist` as a nested path, not a sub-schema, so set("gist", payload) mis-casts the inner subdoc arrays and fails validation with 'Cast to [string] failed' at gist.files.0. Set each subpath individually so the files/comments arrays cast correctly.	2026-05-06 16:52:48 +03:00
tdurieux	dcb524c8c1	Improve error handling	2026-05-06 16:45:22 +03:00
tdurieux	3613c895c8	improve logging	2026-05-06 16:31:10 +03:00
tdurieux	873c910dd3	Improve error dashboard	2026-05-06 16:12:37 +03:00
tdurieux	6f418d6332	log AnonymousError context detail in serialized output Surface the path/repo/url tied to an AnonymousError when it gets serialized for logging — previously logs only carried name, message, and httpStatus, which made file_not_found entries impossible to trace back to a specific file or repo. Extract the existing detail formatting out of toString() into a public detail() method, harden it against AnonymizedFile getters that can throw, and have serializeError include the result as a "detail" field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 13:46:06 +03:00
tdurieux	06a098fba7	perf(deploy): faster builds and zero-downtime streamer rollover - Multi-stage Dockerfile with BuildKit npm cache mounts and a separate prod-deps stage so source edits don't reinstall or prune. - Tighter .dockerignore to shrink build context. - Healthchecks: add start_period and tighten interval/retries so containers report healthy as soon as the process is actually ready instead of after a full polling interval. - Move recoverStuckPreparing() off the startup critical path; the recovery sweep now runs in the background after app.listen. - depends_on uses condition: service_healthy and the obsolete compose 'version' key is gone. - New scripts/build.sh + scripts/deploy.sh: deploy.sh builds, exits early if the image is unchanged, runs a blue/green streamer swap (scale to 2N, wait healthy in parallel, drop olds), then recreates the API with --no-deps to avoid compose's depends_on re-poll.	2026-05-06 13:38:19 +03:00
tdurieux	371693dc3b	fix: end stream for files <= 150 bytes in resolveLfsPointer The LFS-pointer probe buffered up to 150 bytes before deciding whether to forward the blob or swap to the raw URL. For blobs that fit entirely in the probe, decide() ran from the source's end event and attached data/end listeners to an already-ended stream, so out.end() was never called. The response hung until upstream timed out and storage.write left an incomplete cached copy, which then forced a re-fetch on every subsequent read. Pass a sourceEnded flag through decide() and end the output directly when the source has already finished. Also skip the GitHub blob fetch when the tree size is already over MAX_FILE_SIZE, surfacing file_too_big instead of a translated 422.	2026-05-06 13:10:25 +03:00
tdurieux	7dd6d872e9	Improve error handling	2026-05-06 12:44:07 +03:00
tdurieux	e1cf8e3a00	Improve error handling	2026-05-06 12:26:12 +03:00
tdurieux	c4182b5b2a	fix: prevent E11000 on concurrent getRepositoryFromGitHub Replace check-then-insert with atomic findOneAndUpdate upsert keyed on externalId, plus a single E11000 retry fallback. Eliminates the duplicate key race when two requests resolve the same gh_<id> concurrently.	2026-05-06 12:22:42 +03:00
tdurieux	c2d43164d0	error logging improvement, regex fix	2026-05-06 11:16:12 +03:00
dependabot[bot]	e34f45522f	chore(deps): bump uuid and bullmq (#691 ) * chore(deps): bump uuid and bullmq Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependency [bullmq](https://github.com/taskforcesh/bullmq). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 2.4.0 to 5.76.5 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](https://github.com/taskforcesh/bullmq/compare/v2.4.0...v5.76.5) --- updated-dependencies: - dependency-name: bullmq dependency-version: 5.76.5 dependency-type: direct:production - dependency-name: uuid dependency-version: dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> * fix(bullmq): adapt isRunning + getJobs typing for v5 API Worker.isRunning became a method (was a property in v2), and Queue.getJobs now requires a mutable JobType[] (was string[]). * clean up --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tdurieux <durieuxthomas@hotmail.com> Co-authored-by: Thomas Durieux <5577568+tdurieux@users.noreply.github.com>	2026-05-06 08:40:12 +02:00
tdurieux	7e0f900c7a	fix: resolve eslint unused-var and useless-assignment warnings Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 08:22:45 +03:00
tdurieux	79f555769d	improve binary file detection: content sniffing + jsonl support Files like .jsonl that mime-types doesn't know fell through to application/octet-stream and rendered as "Unsupported binary file" in the viewer. Replace istextorbinary with isbinaryfile for content-based detection, and use mime-types for name-based classification with a textual application/* allowlist. The streaming transformer now defers classification when the name is inconclusive and sniffs the first chunk before emitting "transform", so route.ts and AnonymizedFile.ts get a content-aware Content-Type. Whitelists .jsonl and .ndjson to short-circuit dataset files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-06 07:52:48 +03:00
tdurieux	27d6b56da7	fix mulitple bugs	2026-05-05 12:34:03 +03:00
tdurieux	c903627da0	fix build	2026-05-05 10:35:08 +03:00
tdurieux	f8c91ca0af	multiple fixes	2026-05-05 10:32:31 +03:00
tdurieux	5b72b630c4	fix: silent-truncation, token-refresh, and content-type bugs across hot paths Follow-up review pass after the cache fixes turned up several bugs in the same family — silent failures that look like success to the client, plus content-correctness issues in the ZIP and per-file delivery paths. - zipStream: stop calling archive.finalize() on upstream/parser errors. That produced a valid-looking ZIP (200 OK, archive opens) silently missing entries — same class as #694, but worse because the user has no signal anything went wrong. Destroy the response on failure instead so the client sees a connection drop. - zipStream: apply per-repo image/pdf gates inside the entry handler. The single-file /file/... endpoint refuses to serve those types via AnonymizedFile.isFileSupported when image=false / pdf=false, but the ZIP shipped them anyway — privacy-relevant for maintainers who toggle image=false to suppress identifying screenshots. Threaded contentOptions through both ZIP entry points (direct and streamer). - GitHubUtils.getToken: validate the OAuth token-refresh response before persisting. On a non-2xx response or a body without a string token, we used to overwrite the stored token with `undefined`, which then propagated as `Authorization: token undefined` to every API call — 401 even on public repos, with the config.GITHUB_TOKEN fallback unreachable because the field was no longer falsy. - AnonymizedFile.send (streamer branch): forward Content-Type from the upstream streamer response. got.stream(...).pipe(res) carries body bytes only, so the parent response had no Content-Type and browsers guessed (text rendered as download, etc.). Also resolve on res.on("finish") in addition to "close" — keep-alive sockets stay open long after the response is delivered, delaying countView(). - Repository.updateIfNeeded: persist a renamed source.repositoryName even when the commit hasn't changed. Previously the new value lived in memory only and was overwritten on the next reload, so the rename detection ran every request. - Repository.anonymize: stop materialising a dummy {path:"",name:""} FileModel for empty repos. That row collided with the special case in AnonymizedFile.getFileInfo and surfaced in unfiltered listings. - streamer/route POST /: reject filePath segments containing ".." or empty parts. Defence in depth — the parent server validates against FileModel before calling, but the streamer joins filePath straight into the storage path, so any future caller forwarding an unvalidated path could traverse out of the repo root. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 09:19:05 +03:00
tdurieux	f413a30313	fix(cache): make Zip-source caches atomic and robust to partial state Follow-up to the GitHubStream cache fixes. The same poisoned-cache class existed in the GitHubDownload path and a few related spots: - GitHubDownload.download: wipe pre-existing state before extracting and write a .anon-complete marker only after a successful extract. On error, rm the partial cache so a retry starts clean. getFileContent and getFiles now gate on the marker instead of "any file/folder exists," so a half-extracted tree can never be served as canonical. - GitHubDownload.getFileContent: validate cached file size against the upstream FileModel size (via the new AnonymizedFile.size()), same guard as GitHubStream. getFiles filters the marker from the listing. - FileSystem.listFiles: drop the bogus stats.ino.toString() as sha. An inode isn't a content hash; anything comparing it to a Git blob sha would silently disagree. Leave undefined. - S3.write: remove the fire-and-forget data.on("error") -> this.rm(...). Multipart Upload doesn't commit partial objects, so there was nothing to clean up, and the handler raced retries and could delete a previously-good object on a transient source-stream hiccup. The size-validated read path recovers from any other undersized objects. - GitHubStream.resolveLfsPointer: drop the post-decision early-return in blobStream.on("error"). Currently redundant with the inner listener, but removes the future-refactor footgun. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 08:54:42 +03:00
tdurieux	9adff11e74	fix(cache): atomic file writes and size-validated cache reads A failed/interrupted GitHub fetch could leave a 0-byte or truncated file in the local cache. Subsequent reads happily streamed the empty content as the file's body — visible to users as an "Empty file" with HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694). - FileSystem.write: stream into a sibling .tmp and rename into place only on finish. Stream errors discard the tmp and leave any prior cached file untouched. Drop the utf-8 encoding that was silently corrupting binary blobs. - GitHubStream.getFileContentCache: accept an expected size and treat cached.size < expected as a poisoned cache (truncated fetch) → rm and re-fetch. cached.size >= expected is accepted, which keeps Git LFS-resolved files (whose FileModel.size is the pointer size) working. - AnonymizedFile: expose size() and pass it through to the streamer alongside sha so the cache check has the upstream size. Existing poisoned entries self-heal on next access. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 08:47:41 +03:00
tdurieux	53959f677c	fix(repo): reuse existing record on rename to avoid externalId duplicate key When a GitHub repo is renamed and looked up by its new name, the lookup by name misses but a record with the same externalId still exists, causing E11000 on save. Fall back to a lookup by externalId before creating a new document. Fixes #500 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 00:59:38 +03:00
tdurieux	dee406e2ea	update design	2026-05-05 00:36:42 +03:00

1 2 3 4 5 ...

342 Commits