Merge pull request #2591 from enkore/issue/1101.design

docs: file integrity
2025-03-04 10:39:50 +00:00 · 2017-06-03 18:17:14 +02:00 · 2017-06-03 18:17:14 +02:00 · d95551736d
commit d95551736d
parent 372eb40089 b8e40fdce6
1 changed files with 183 additions and 0 deletions
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@ -715,3 +715,186 @@ In case you run into troubles with the locks, you can use the ``borg break-lock`
 command after you first have made sure that no |project_name| process is
 running on any machine that accesses this resource. Be very careful, the cache
 or repository might get damaged if multiple processes use it at the same time.
+
+Checksumming data structures
+----------------------------
+
+As detailed in the previous sections, Borg generates and stores various files
+containing important meta data, such as the repository index, repository hints,
+chunks caches and files cache.
+
+Data corruption in these files can damage the archive data in a repository,
+e.g. due to wrong reference counts in the chunks cache. Only some parts of Borg
+were designed to handle corrupted data structures, so a corrupted files cache
+may cause crashes or write incorrect archives.
+
+Therefore, Borg calculates checksums when writing these files and tests checksums
+when reading them. Checksums are generally 64-bit XXH64 hashes.
+The canonical xxHash representation is used, i.e. big-endian.
+Checksums are stored as hexadecimal ASCII strings.
+
+For compatibility, checksums are not required and absent checksums do not trigger errors.
+The mechanisms have been designed to avoid false-positives when various Borg
+versions are used alternately on the same repositories.
+
+Checksums are a data safety mechanism. They are not a security mechanism.
+
+.. rubric:: Choice of algorithm
+
+XXH64 has been chosen for its high speed on all platforms, which avoids performance
+degradation in CPU-limited parts (e.g. cache synchronization).
+Unlike CRC32, it neither requires hardware support (crc32c or CLMUL)
+nor vectorized code nor large, cache-unfriendly lookup tables to achieve good performance.
+This simplifies deployment of it considerably (cf. src/borg/algorithms/crc32...).
+
+Further, XXH64 is a non-linear hash function and thus has a "more or less" good
+chance to detect larger burst errors, unlike linear CRCs where the probability
+of detection decreases with error size.
+
+The 64-bit checksum length is considered sufficient for the file sizes typically
+checksummed (individual files up to a few GB, usually less).
+xxHash was expressly designed for data blocks of these sizes.
+
+Lower layer — file_integrity
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To accommodate the different transaction models used for the cache and repository,
+there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile)
+wrapping a file-like object, performing streaming calculation and comparison of checksums.
+Checksum errors are signalled by raising an exception (borg.crypto.file_integrity.FileIntegrityError)
+at the earliest possible moment.
+
+.. rubric:: Calculating checksums
+
+Before feeding the checksum algorithm any data, the file name (i.e. without any path)
+is mixed into the checksum, since the name encodes the context of the data for Borg.
+
+The various indices used by Borg have separate header and main data parts.
+IntegrityCheckedFile allows to checksum them independently, which avoids
+even reading the data when the header is corrupted. When a part is signalled,
+the length of the part name is mixed into the checksum state first (encoded
+as an ASCII string via `%10d` printf format), then the name of the part
+is mixed in as an UTF-8 string. Lastly, the current position (length)
+in the file is mixed in as well.
+
+The checksum state is not reset at part boundaries.
+
+A final checksum is always calculated in the same way as the parts described above,
+after seeking to the end of the file. The final checksum cannot prevent code
+from processing corrupted data during reading, however, it prevents use of the
+corrupted data.
+
+.. rubric:: Serializing checksums
+
+All checksums are compiled into a simple JSON structure called *integrity data*:
+
+.. code-block:: json
+
+    {
+        "algorithm": "XXH64",
+        "digests": {
+            "HashHeader": "eab6802590ba39e3",
+            "final": "e2a7f132fc2e8b24"
+        }
+    }
+
+The *algorithm* key notes the used algorithm. When reading, integrity data containing
+an unknown algorithm is not inspected further.
+
+The *digests* key contains a mapping of part names to their digests.
+
+Integrity data is generally stored by the upper layers, introduced below. An exception
+is the DetachedIntegrityCheckedFile, which automatically writes and reads it from
+a ".integrity" file next to the data file.
+It is used for archive chunks indexes in chunks.archive.d.
+
+Upper layer
+~~~~~~~~~~~
+
+Storage of integrity data depends on the component using it, since they have
+different transaction mechanisms, and integrity data needs to be
+transacted with the data it is supposed to protect.
+
+.. rubric:: Main cache files: chunks and files cache
+
+The integrity data of the ``chunks`` and ``files`` caches is stored in the
+cache ``config``, since all three are transacted together.
+
+The ``[integrity]`` section is used:
+
+.. code-block:: ini
+
+    [cache]
+    version = 1
+    repository = 3c4...e59
+    manifest = 10e...21c
+    timestamp = 2017-06-01T21:31:39.699514
+    key_type = 2
+    previous_location = /path/to/repo
+
+    [integrity]
+    manifest = 10e...21c
+    chunks = {"algorithm": "XXH64", "digests": {"HashHeader": "eab...39e3", "final": "e2a...b24"}}
+
+The manifest ID is duplicated in the integrity section due to the way all Borg
+versions handle the config file. Instead of creating a "new" config file from
+an internal representation containing only the data understood by Borg,
+the config file is read in entirety (using the Python ConfigParser) and modified.
+This preserves all sections and values not understood by the Borg version
+modifying it.
+
+Thus, if an older versions uses a cache with integrity data, it would preserve
+the integrity section and its contents. If a integrity-aware Borg version
+would read this cache, it would incorrectly report checksum errors, since
+the older version did not update the checksums.
+
+However, by duplicating the manifest ID in the integrity section, it is
+easy to tell whether the checksums concern the current state of the cache.
+
+Integrity errors are fatal in these files, terminating the program,
+and are not automatically corrected at this time.
+
+.. rubric:: chunks.archive.d
+
+Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile,
+which writes the integrity data to a separate ".integrity" file.
+
+Integrity errors result in deleting the affected index and rebuilding it.
+This logs a warning and increases the exit code to WARNING (1).
+
+.. rubric:: Repository index and hints
+
+The repository associates index and hints files with a transaction by including the
+transaction ID in the file names. Integrity data is stored in a third file
+("integrity.<TRANSACTION_ID>"). Like the hints file, it is msgpacked:
+
+.. code-block:: python
+
+    {
+        b'version': 2,
+        b'hints': b'{"algorithm": "XXH64", "digests": {"final": "411208db2aa13f1a"}}',
+        b'index': b'{"algorithm": "XXH64", "digests": {"HashHeader": "846b7315f91b8e48", "final": "cb3e26cadc173e40"}}'
+    }
+
+The *version* key started at 2, the same version used for the hints. Since Borg has
+many versioned file formats, this keeps the number of different versions in use
+a bit lower.
+
+The other keys map an auxiliary file, like *index* or *hints* to their integrity data.
+Note that the JSON is stored as-is, and not as part of the msgpack structure.
+
+Integrity errors result in deleting the affected file(s) (index/hints) and rebuilding the index,
+which is the same action taken when corruption is noticed in other ways (e.g. HashIndex can
+detect most corrupted headers, but not data corruption). A warning is logged as well.
+The exit code is not influenced, since remote repositories cannot perform that action.
+Raising the exit code would be possible for local repositories, but is not implemented.
+
+Unlike the cache design this mechanism can have false positives whenever an older version
+*rewrites* the auxiliary files for a transaction created by a newer version,
+since that might result in a different index (due to hash-table resizing) or hints file
+(hash ordering, or the older version 1 format), while not invalidating the integrity file.
+
+For example, using 1.1 on a repository, noticing corruption or similar issues and then running
+``borg-1.0 check --repair``, which rewrites the index and hints, results in this situation.
+Borg 1.1 would erroneously report checksum errors in the hints and/or index files and trigger
+an automatic rebuild of these files.