mirror of
https://github.com/tdurieux/anonymous_github.git
synced 2026-05-15 14:38:03 +02:00
feat: preserve raw bytes when anonymization is a no-op
When the anonymizer doesn't change a slice's text, the streamer used to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes in the input (replaced by U+FFFD via StringDecoder). Files mistakenly classified as text (binary blobs without a known extension, text with stray non-UTF-8 bytes, BOMs) came out corrupted even though nothing in the term list matched. Track the raw chunk bytes alongside the decoded `pending`. On flush — where we have every byte buffered — emit the original buffer directly when the output equals the input, so a pure passthrough is bit-exact. In the streaming OVERLAP path, do the same when the decode for that slice round-trips losslessly; fall back to encoded output otherwise (unchanged from before for that case). Also add the "missing_content" locale entry for the /api/anonymize-preview route.
This commit is contained in:
@@ -38,6 +38,7 @@
|
||||
"options_not_provided": "Anonymization options are mandatory.",
|
||||
"terms_not_specified": "Anonymization terms must be specified.",
|
||||
"invalid_terms_format": "Terms are in an invalid format.",
|
||||
"missing_content": "No content was provided to the anonymization preview.",
|
||||
"unable_to_anonymize": "An error happened during the anonymization process. Please try later or report the issue.",
|
||||
"non_supported_mode": "The selected anonymization mode is invalid, only download and stream are supported.",
|
||||
"invalid_path": "The provided path is invalid or missing.",
|
||||
|
||||
Reference in New Issue
Block a user