feat: preserve raw bytes when anonymization is a no-op

When the anonymizer doesn't change a slice's text, the streamer used to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes in the input (replaced by U+FFFD via StringDecoder). Files mistakenly classified as text (binary blobs without a known extension, text with stray non-UTF-8 bytes, BOMs) came out corrupted even though nothing in the term list matched. Track the raw chunk bytes alongside the decoded `pending`. On flush — where we have every byte buffered — emit the original buffer directly when the output equals the input, so a pure passthrough is bit-exact. In the streaming OVERLAP path, do the same when the decode for that slice round-trips losslessly; fall back to encoded output otherwise (unchanged from before for that case). Also add the "missing_content" locale entry for the /api/anonymize-preview route.
2026-06-30 02:55:30 +02:00 · 2026-05-04 11:52:03 +02:00
parent 4aeecd0fdb
commit ef78e8ff3c
3 changed files with 146 additions and 2 deletions
@@ -38,6 +38,7 @@
    "options_not_provided": "Anonymization options are mandatory.",
    "terms_not_specified": "Anonymization terms must be specified.",
    "invalid_terms_format": "Terms are in an invalid format.",
+    "missing_content": "No content was provided to the anonymization preview.",
    "unable_to_anonymize": "An error happened during the anonymization process. Please try later or report the issue.",
    "non_supported_mode": "The selected anonymization mode is invalid, only download and stream are supported.",
    "invalid_path": "The provided path is invalid or missing.",