fix(cache): atomic file writes and size-validated cache reads

A failed/interrupted GitHub fetch could leave a 0-byte or truncated
file in the local cache. Subsequent reads happily streamed the empty
content as the file's body — visible to users as an "Empty file" with
HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694).

- FileSystem.write: stream into a sibling .tmp and rename into place
  only on finish. Stream errors discard the tmp and leave any prior
  cached file untouched. Drop the utf-8 encoding that was silently
  corrupting binary blobs.
- GitHubStream.getFileContentCache: accept an expected size and treat
  cached.size < expected as a poisoned cache (truncated fetch) → rm
  and re-fetch. cached.size >= expected is accepted, which keeps
  Git LFS-resolved files (whose FileModel.size is the pointer size)
  working.
- AnonymizedFile: expose size() and pass it through to the streamer
  alongside sha so the cache check has the upstream size.

Existing poisoned entries self-heal on next access.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
tdurieux
2026-05-05 08:47:41 +03:00
parent 53959f677c
commit 9adff11e74
4 changed files with 76 additions and 11 deletions
+3 -1
View File
@@ -42,6 +42,8 @@ router.post("/", async (req: express.Request, res: express.Response) => {
const repoFullName = req.body.repoFullName.split("/");
const repoId = req.body.repoId;
const fileSha = req.body.sha;
const fileSize: number | undefined =
typeof req.body.size === "number" ? req.body.size : undefined;
const commit = req.body.commit;
const filePath = req.body.filePath;
const anonymizerOptions = req.body.anonymizerOptions;
@@ -58,7 +60,7 @@ router.post("/", async (req: express.Request, res: express.Response) => {
const content = await source.getFileContentCache(
filePath,
repoId,
() => fileSha
() => ({ sha: fileSha, size: fileSize })
);
const mime = lookup(filePath);
if (mime && !filePath.endsWith(".ts")) {