document how borg deals with non-unicode bytes in JSON output

2025-02-12 17:35:44 +00:00 · 2022-12-29 22:16:33 +01:00 · 2022-12-29 22:16:33 +01:00 · 8765e62bcd
commit 8765e62bcd
parent e63cfcd708
1 changed files with 36 additions and 2 deletions
--- a/docs/internals/frontends.rst
+++ b/docs/internals/frontends.rst
@ -29,6 +29,42 @@ On POSIX systems, you can usually set environment vars to choose a UTF-8 locale:
    export LC_CTYPE=en_US.UTF-8


+Dealing with non-unicode byte sequences and JSON limitations
+------------------------------------------------------------
+
+Paths on POSIX systems can have arbitrary bytes in them (except 0x00 which is used as string terminator in C).
+
+Nowadays, UTF-8 encoded paths (which decode to valid unicode) are the usual thing, but a lot of systems
+still have paths from the past, when other, non-unicode codings were used. Especially old Samba shares often
+have wild mixtures of misc. encodings, sometimes even very broken stuff.
+
+borg deals with such non-unicode paths ("with funny/broken characters") by decoding such byte sequences using
+UTF-8 coding and "surrogateescape" error handling mode, which maps invalid bytes to special unicode code points
+(surrogate escapes). When encoding such a unicode string back to a byte sequence, the original byte sequence
+will be reproduced exactly.
+
+JSON should only contain valid unicode text without any surrogate escapes, so we can't just directly have a
+surrogate-escaped path in JSON ("path" is only one example, this also affects other text-like content).
+
+Borg deals with this situation like this (since borg 2.0):
+
+For a valid unicode path (no surrogate escapes), the JSON will only have "path": path.
+
+For a non-unicode path (with surrogate escapes), the JSON will have 2 entries:
+
+- "path": path_approximation (pure valid unicode, all invalid bytes will show up as "?")
+- "path_b64": path_bytes_base64_encoded (if you decode the base64, you get the original path byte string)
+
+JSON users need to pick whatever suits their needs best. The suggested procedure (shown for "path") is:
+
+- check if there is a "path_b64" key.
+- if it is there, you will know that the original bytes path did not cleanly UTF-8-decode into unicode (has
+  some invalid bytes) and that the string given by the "path" key is only an approximation, but not the precise
+  path. if you need precision, you must base64-decode the value of "path_b64" and deal with the arbitrary byte
+  string you'll get. if an approximation is fine, use the value of the "path" key.
+- if it is not there, the value of the "path" key is all you need (the original bytes path is its UTF-8 encoding).
+
+
 Logging
 -------

@ -40,8 +76,6 @@ where each line is a JSON object. The *type* key of the object determines its ot
    parsing error will be printed in plain text, because logging set-up happens after all arguments are
    parsed.

-Since JSON can only encode text, any string representing a file system path may miss non-text parts.
-
 The following types are in use. Progress information is governed by the usual rules for progress information,
 it is not produced unless ``--progress`` is specified.