A failed/interrupted GitHub fetch could leave a 0-byte or truncated
file in the local cache. Subsequent reads happily streamed the empty
content as the file's body — visible to users as an "Empty file" with
HTTP 200. Reproduced on artifact-70B6/Lethe/configs.py (#694).
- FileSystem.write: stream into a sibling .tmp and rename into place
only on finish. Stream errors discard the tmp and leave any prior
cached file untouched. Drop the utf-8 encoding that was silently
corrupting binary blobs.
- GitHubStream.getFileContentCache: accept an expected size and treat
cached.size < expected as a poisoned cache (truncated fetch) → rm
and re-fetch. cached.size >= expected is accepted, which keeps
Git LFS-resolved files (whose FileModel.size is the pointer size)
working.
- AnonymizedFile: expose size() and pass it through to the streamer
alongside sha so the cache check has the upstream size.
Existing poisoned entries self-heal on next access.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Files tracked by Git LFS used to come out as the pointer text:
version https://git-lfs.github.com/spec/v1
oid sha256:...
size ...
…because GitHub's blob API returns the pointer, not the resolved
content. Detect that prefix on the first ~150 bytes of the blob stream
and switch to a fresh fetch via the web raw URL
(github.com/<owner>/<repo>/raw/<commit>/<path>), which auto-redirects
to media.githubusercontent.com and resolves the LFS object — auth
header carries through. Non-LFS files are forwarded through the
existing pipeline unchanged.
Fixes#95.