feat: preserve raw bytes when anonymization is a no-op

When the anonymizer doesn't change a slice's text, the streamer used
to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes
in the input (replaced by U+FFFD via StringDecoder). Files
mistakenly classified as text (binary blobs without a known extension,
text with stray non-UTF-8 bytes, BOMs) came out corrupted even though
nothing in the term list matched.

Track the raw chunk bytes alongside the decoded `pending`. On flush —
where we have every byte buffered — emit the original buffer directly
when the output equals the input, so a pure passthrough is bit-exact.
In the streaming OVERLAP path, do the same when the decode for that
slice round-trips losslessly; fall back to encoded output otherwise
(unchanged from before for that case).

Also add the "missing_content" locale entry for the
/api/anonymize-preview route.
This commit is contained in:
tdurieux
2026-05-04 11:52:03 +02:00
parent 4aeecd0fdb
commit ef78e8ff3c
3 changed files with 146 additions and 2 deletions
+1
View File
@@ -38,6 +38,7 @@
"options_not_provided": "Anonymization options are mandatory.",
"terms_not_specified": "Anonymization terms must be specified.",
"invalid_terms_format": "Terms are in an invalid format.",
"missing_content": "No content was provided to the anonymization preview.",
"unable_to_anonymize": "An error happened during the anonymization process. Please try later or report the issue.",
"non_supported_mode": "The selected anonymization mode is invalid, only download and stream are supported.",
"invalid_path": "The provided path is invalid or missing.",