From 45ee62e5eaf4219f571dfb6ebe14d9fb01309278 Mon Sep 17 00:00:00 2001
From: Marian Beermann <public@enkore.de>
Date: Sat, 3 Jun 2017 00:43:39 +0200
Subject: [PATCH 1/2] docs: file integrity

---
 docs/internals/data-structures.rst | 167 +++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)

diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst
index 2a688f8b3..4dd296c44 100644
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@@ -715,3 +715,170 @@ In case you run into troubles with the locks, you can use the ``borg break-lock`
 command after you first have made sure that no |project_name| process is
 running on any machine that accesses this resource. Be very careful, the cache
 or repository might get damaged if multiple processes use it at the same time.
+
+Checksumming data structures
+----------------------------
+
+As detailed in the previous sections, Borg generates and stores various files
+containing important meta data, such as the repository index, repository hints,
+chunks caches and files cache.
+
+Data corruption in these files can damage the archive data in a repository,
+e.g. due to wrong reference counts in the chunks cache. Only some parts of Borg
+were designed to handle corrupted data structures, so a corrupted files cache
+may cause crashes or write incorrect archives.
+
+Therefore, Borg calculates checksums when writing these files and tests checksums
+when reading them. Checksums are generally 64-bit XXH64 checksums.
+XXH64 has been chosen for its high speed on all platforms, which avoids performance
+degradation in CPU-limited parts (e.g. cache synchronization). Unlike CRC32,
+it does neither require hardware support (crc32c or CLMUL) nor vectorized code
+nor large, cache-unfriendly lookup tables to achieve good performance.
+This simplifies deployment of it considerably (cf. src/borg/algorithms/crc32...).
+
+Further, XXH64 is a non-linear hash function and thus has a "more or less" good
+chance to detect larger burst errors, unlike linear CRCs where the probability
+of detection decreases with error size.
+
+The 64-bit checksum length is considered sufficient for the file sizes typically
+checksummed (individual files up to a few GB, usually less).
+
+The canonical xxHash representation is used, i.e. big-endian.
+Checksums are generally stored as hexadecimal ASCII strings.
+
+Lower layer — file_integrity
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To accommodate the different transaction models used for the cache and repository,
+there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile) which
+wraps a file-like object and performs streaming calculation and comparison of checksums.
+Checksum errors are signalled by raising an exception (borg.crypto.file_integrity.FileIntegrityError)
+at the earliest possible moment.
+
+.. rubric:: Calculating checksums
+
+The various indices used by Borg have separate header and main data parts.
+IntegrityCheckedFile allows to checksum them independently, which avoids
+even reading the data when the header is corrupted. When a part is signalled,
+the length of the pathname is mixed into the checksum state first (encoded
+as an ASCII string via `%10d` printf format), then the name of the part
+is mixed in as an UTF-8 string. Lastly, the current position (length)
+in the file is mixed in as well.
+
+The checksum state is not reset at part boundaries.
+
+A final checksum is always calculated from the entire state.
+
+.. rubric:: Serializing checksums
+
+All checksums are compiled into a simple JSON structure called *integrity data*:
+
+.. code-block:: json
+
+    {
+        "algorithm": "XXH64",
+        "digests": {
+            "HashHeader": "eab6802590ba39e3",
+            "final": "e2a7f132fc2e8b24"
+        }
+    }
+
+The *algorithm* key notes the used algorithm. When reading, integrity data containing
+an unknown algorithm is not inspected further.
+
+The *digests* key contains a mapping of part names to their digests.
+
+Integrity data is generally stored by the upper layers, introduced below. An exception
+is the DetachedIntegrityCheckedFile, which automatically writes and reads it from
+a ".integrity" file next to the data file. It is used for archive chunks in chunks.archive.d.
+
+Upper layer
+~~~~~~~~~~~
+
+Storage of integrity data depends on the component using it, since they have
+different transaction mechanisms, and integrity data needs to be
+transacted with the data it is supposed to protect.
+
+.. rubric:: Main cache files: chunks and files cache
+
+The integrity data of the ``chunks`` and ``files`` caches is stored in the
+cache ``config``, since all three are transacted together.
+
+The ``[integrity]`` section is used:
+
+.. code-block:: ini
+
+    [cache]
+    version = 1
+    repository = 3c4...e59
+    manifest = 10e...21c
+    timestamp = 2017-06-01T21:31:39.699514
+    key_type = 2
+    previous_location = /path/to/repo
+
+    [integrity]
+    manifest = 10e...21c
+    chunks = {"algorithm": "XXH64", "digests": {"HashHeader": "eab...39e3", "final": "e2a...b24"}}
+
+The manifest ID is duplicated in the integrity section due to the way all Borg
+versions handle the config file. Instead of creating a "new" config file from
+an internal representation containing only the data understood by Borg,
+the config file is read in entirety (using the Python ConfigParser) and modified.
+This preserves all sections and values not understood by the Borg version
+modifying it.
+
+Thus, if an older versions uses a cache with integrity data, it would preserve
+the integrity section and its contents. If a integrity-aware Borg version
+would read this cache, it would incorrectly report checksum errors, since
+the older version did not update the checksums.
+
+However, by duplicating the manifest ID in the integrity section, it is
+easy to tell whether the checksums concern the current state of the cache.
+
+Integrity errors are fatal in these files, terminating the program,
+and are not automatically corrected at this time.
+
+.. rubric:: chunks.archive.d
+
+Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile, which
+writes the integrity data to a separate ".integrity" file.
+
+Integrity errors result in deleting the affected index and rebuilding it.
+This logs a warning and increases the exit code to WARNING (1).
+
+.. rubric:: Repository index and hints
+
+The repository associates index and hints files with a transaction by including the
+transaction ID in the file names. Integrity data is stored in a third file
+("integrity.<TRANSACTION_ID>"). Like the hints file, it is msgpacked:
+
+.. code-block:: python
+
+    {
+        b'version': 2,
+        b'hints': b'{"algorithm": "XXH64", "digests": {"final": "411208db2aa13f1a"}}',
+        b'index': b'{"algorithm": "XXH64", "digests": {"HashHeader": "846b7315f91b8e48", "final": "cb3e26cadc173e40"}}'
+    }
+
+The *version* key started at 2, the same version used for the hints. Since Borg has
+many versioned file formats, this keeps the number of different versions in use
+a bit lower.
+
+The other keys map an auxiliary file, like *index* or *hints* to their integrity data.
+Note that the JSON is stored as-is, and not as part of the msgpack structure.
+
+Integrity errors result in deleting the affected file(s) (index/hints) and rebuilding the index,
+which is the same action taken when corruption is noticed in other ways (e.g. HashIndex can
+detect most corrupted headers, but not data corruption). A warning is logged as well.
+The exit code is not influenced, since remote repositories cannot perform that action.
+Raising the exit code would be possible for local repositories, but is not implemented.
+
+Unlike the cache design this mechanism can have false positives whenever an older version
+*rewrites* the auxiliary files for a transaction created by a newer version,
+since that might result in a different index (due to hash-table resizing) or hints file
+(hash ordering, or the older version 1 format), while not invalidating the integrity file.
+
+For example, using 1.1 on a repository, noticing corruption or similar issues and then running
+``borg-1.0 check --repair``, which rewrites the index and hints, results in this situation.
+Borg 1.1 would erroneously report checksum errors in the hints and/or index files and trigger
+an automatic rebuild of these files.

From b8e40fdce609c012b3197f6b6644a1498019282f Mon Sep 17 00:00:00 2001
From: Marian Beermann <public@enkore.de>
Date: Sat, 3 Jun 2017 13:04:05 +0200
Subject: [PATCH 2/2] editing

---
 docs/internals/data-structures.rst | 44 ++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst
index 4dd296c44..14c870d36 100644
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@@ -729,11 +729,22 @@ were designed to handle corrupted data structures, so a corrupted files cache
 may cause crashes or write incorrect archives.
 
 Therefore, Borg calculates checksums when writing these files and tests checksums
-when reading them. Checksums are generally 64-bit XXH64 checksums.
+when reading them. Checksums are generally 64-bit XXH64 hashes.
+The canonical xxHash representation is used, i.e. big-endian.
+Checksums are stored as hexadecimal ASCII strings.
+
+For compatibility, checksums are not required and absent checksums do not trigger errors.
+The mechanisms have been designed to avoid false-positives when various Borg
+versions are used alternately on the same repositories.
+
+Checksums are a data safety mechanism. They are not a security mechanism.
+
+.. rubric:: Choice of algorithm
+
 XXH64 has been chosen for its high speed on all platforms, which avoids performance
-degradation in CPU-limited parts (e.g. cache synchronization). Unlike CRC32,
-it does neither require hardware support (crc32c or CLMUL) nor vectorized code
-nor large, cache-unfriendly lookup tables to achieve good performance.
+degradation in CPU-limited parts (e.g. cache synchronization).
+Unlike CRC32, it neither requires hardware support (crc32c or CLMUL)
+nor vectorized code nor large, cache-unfriendly lookup tables to achieve good performance.
 This simplifies deployment of it considerably (cf. src/borg/algorithms/crc32...).
 
 Further, XXH64 is a non-linear hash function and thus has a "more or less" good
@@ -742,32 +753,36 @@ of detection decreases with error size.
 
 The 64-bit checksum length is considered sufficient for the file sizes typically
 checksummed (individual files up to a few GB, usually less).
-
-The canonical xxHash representation is used, i.e. big-endian.
-Checksums are generally stored as hexadecimal ASCII strings.
+xxHash was expressly designed for data blocks of these sizes.
 
 Lower layer — file_integrity
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To accommodate the different transaction models used for the cache and repository,
-there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile) which
-wraps a file-like object and performs streaming calculation and comparison of checksums.
+there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile)
+wrapping a file-like object, performing streaming calculation and comparison of checksums.
 Checksum errors are signalled by raising an exception (borg.crypto.file_integrity.FileIntegrityError)
 at the earliest possible moment.
 
 .. rubric:: Calculating checksums
 
+Before feeding the checksum algorithm any data, the file name (i.e. without any path)
+is mixed into the checksum, since the name encodes the context of the data for Borg.
+
 The various indices used by Borg have separate header and main data parts.
 IntegrityCheckedFile allows to checksum them independently, which avoids
 even reading the data when the header is corrupted. When a part is signalled,
-the length of the pathname is mixed into the checksum state first (encoded
+the length of the part name is mixed into the checksum state first (encoded
 as an ASCII string via `%10d` printf format), then the name of the part
 is mixed in as an UTF-8 string. Lastly, the current position (length)
 in the file is mixed in as well.
 
 The checksum state is not reset at part boundaries.
 
-A final checksum is always calculated from the entire state.
+A final checksum is always calculated in the same way as the parts described above,
+after seeking to the end of the file. The final checksum cannot prevent code
+from processing corrupted data during reading, however, it prevents use of the
+corrupted data.
 
 .. rubric:: Serializing checksums
 
@@ -790,7 +805,8 @@ The *digests* key contains a mapping of part names to their digests.
 
 Integrity data is generally stored by the upper layers, introduced below. An exception
 is the DetachedIntegrityCheckedFile, which automatically writes and reads it from
-a ".integrity" file next to the data file. It is used for archive chunks in chunks.archive.d.
+a ".integrity" file next to the data file.
+It is used for archive chunks indexes in chunks.archive.d.
 
 Upper layer
 ~~~~~~~~~~~
@@ -840,8 +856,8 @@ and are not automatically corrected at this time.
 
 .. rubric:: chunks.archive.d
 
-Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile, which
-writes the integrity data to a separate ".integrity" file.
+Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile,
+which writes the integrity data to a separate ".integrity" file.
 
 Integrity errors result in deleting the affected index and rebuilding it.
 This logs a warning and increases the exit code to WARNING (1).