From 8765e62bcd6be1675b64ad3ae4a721137145bf77 Mon Sep 17 00:00:00 2001
From: Thomas Waldmann <tw@waldmann-edv.de>
Date: Thu, 29 Dec 2022 22:16:33 +0100
Subject: [PATCH] document how borg deals with non-unicode bytes in JSON output

---
 docs/internals/frontends.rst | 38 ++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/docs/internals/frontends.rst b/docs/internals/frontends.rst
index 42c6c67aa..7f2af1e5b 100644
--- a/docs/internals/frontends.rst
+++ b/docs/internals/frontends.rst
@@ -29,6 +29,42 @@ On POSIX systems, you can usually set environment vars to choose a UTF-8 locale:
     export LC_CTYPE=en_US.UTF-8
 
 
+Dealing with non-unicode byte sequences and JSON limitations
+------------------------------------------------------------
+
+Paths on POSIX systems can have arbitrary bytes in them (except 0x00 which is used as string terminator in C).
+
+Nowadays, UTF-8 encoded paths (which decode to valid unicode) are the usual thing, but a lot of systems
+still have paths from the past, when other, non-unicode codings were used. Especially old Samba shares often
+have wild mixtures of misc. encodings, sometimes even very broken stuff.
+
+borg deals with such non-unicode paths ("with funny/broken characters") by decoding such byte sequences using
+UTF-8 coding and "surrogateescape" error handling mode, which maps invalid bytes to special unicode code points
+(surrogate escapes). When encoding such a unicode string back to a byte sequence, the original byte sequence
+will be reproduced exactly.
+
+JSON should only contain valid unicode text without any surrogate escapes, so we can't just directly have a
+surrogate-escaped path in JSON ("path" is only one example, this also affects other text-like content).
+
+Borg deals with this situation like this (since borg 2.0):
+
+For a valid unicode path (no surrogate escapes), the JSON will only have "path": path.
+
+For a non-unicode path (with surrogate escapes), the JSON will have 2 entries:
+
+- "path": path_approximation (pure valid unicode, all invalid bytes will show up as "?")
+- "path_b64": path_bytes_base64_encoded (if you decode the base64, you get the original path byte string)
+
+JSON users need to pick whatever suits their needs best. The suggested procedure (shown for "path") is:
+
+- check if there is a "path_b64" key.
+- if it is there, you will know that the original bytes path did not cleanly UTF-8-decode into unicode (has
+  some invalid bytes) and that the string given by the "path" key is only an approximation, but not the precise
+  path. if you need precision, you must base64-decode the value of "path_b64" and deal with the arbitrary byte
+  string you'll get. if an approximation is fine, use the value of the "path" key.
+- if it is not there, the value of the "path" key is all you need (the original bytes path is its UTF-8 encoding).
+
+
 Logging
 -------
 
@@ -40,8 +76,6 @@ where each line is a JSON object. The *type* key of the object determines its ot
     parsing error will be printed in plain text, because logging set-up happens after all arguments are
     parsed.
 
-Since JSON can only encode text, any string representing a file system path may miss non-text parts.
-
 The following types are in use. Progress information is governed by the usual rules for progress information,
 it is not produced unless ``--progress`` is specified.