Merge pull request #2591 from enkore/issue/1101.design

docs: file integrity
2017-06-03 18:17:14 +02:00 · 2017-06-03 18:17:14 +02:00 · d95551736d
parent 372eb40089 b8e40fdce6
commit d95551736d
1 changed files with 183 additions and 0 deletions
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@ -715,3 +715,186 @@ In case you run into troubles with the locks, you can use the ``borg break-lock`
 command after you first have made sure that no |project_name| process is
 running on any machine that accesses this resource. Be very careful, the cache
 or repository might get damaged if multiple processes use it at the same time.
 Checksumming data structures
 ----------------------------
 As detailed in the previous sections, Borg generates and stores various files
 containing important meta data, such as the repository index, repository hints,
 chunks caches and files cache.
 Data corruption in these files can damage the archive data in a repository,
 e.g. due to wrong reference counts in the chunks cache. Only some parts of Borg
 were designed to handle corrupted data structures, so a corrupted files cache
 may cause crashes or write incorrect archives.
 Therefore, Borg calculates checksums when writing these files and tests checksums
 when reading them. Checksums are generally 64-bit XXH64 hashes.
 The canonical xxHash representation is used, i.e. big-endian.
 Checksums are stored as hexadecimal ASCII strings.
 For compatibility, checksums are not required and absent checksums do not trigger errors.
 The mechanisms have been designed to avoid false-positives when various Borg
 versions are used alternately on the same repositories.
 Checksums are a data safety mechanism. They are not a security mechanism.
 .. rubric:: Choice of algorithm
 XXH64 has been chosen for its high speed on all platforms, which avoids performance
 degradation in CPU-limited parts (e.g. cache synchronization).
 Unlike CRC32, it neither requires hardware support (crc32c or CLMUL)
 nor vectorized code nor large, cache-unfriendly lookup tables to achieve good performance.
 This simplifies deployment of it considerably (cf. src/borg/algorithms/crc32...).
 Further, XXH64 is a non-linear hash function and thus has a "more or less" good
 chance to detect larger burst errors, unlike linear CRCs where the probability
 of detection decreases with error size.
 The 64-bit checksum length is considered sufficient for the file sizes typically
 checksummed (individual files up to a few GB, usually less).
 xxHash was expressly designed for data blocks of these sizes.
 Lower layer — file_integrity
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 To accommodate the different transaction models used for the cache and repository,
 there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile)
 wrapping a file-like object, performing streaming calculation and comparison of checksums.
 Checksum errors are signalled by raising an exception (borg.crypto.file_integrity.FileIntegrityError)
 at the earliest possible moment.
 .. rubric:: Calculating checksums
 Before feeding the checksum algorithm any data, the file name (i.e. without any path)
 is mixed into the checksum, since the name encodes the context of the data for Borg.
 The various indices used by Borg have separate header and main data parts.
 IntegrityCheckedFile allows to checksum them independently, which avoids
 even reading the data when the header is corrupted. When a part is signalled,
 the length of the part name is mixed into the checksum state first (encoded
 as an ASCII string via `%10d` printf format), then the name of the part
 is mixed in as an UTF-8 string. Lastly, the current position (length)
 in the file is mixed in as well.
 The checksum state is not reset at part boundaries.
 A final checksum is always calculated in the same way as the parts described above,
 after seeking to the end of the file. The final checksum cannot prevent code
 from processing corrupted data during reading, however, it prevents use of the
 corrupted data.
 .. rubric:: Serializing checksums
 All checksums are compiled into a simple JSON structure called *integrity data*:
 .. code-block:: json
    {
        "algorithm": "XXH64",
        "digests": {
            "HashHeader": "eab6802590ba39e3",
            "final": "e2a7f132fc2e8b24"
        }
    }
 The *algorithm* key notes the used algorithm. When reading, integrity data containing
 an unknown algorithm is not inspected further.
 The *digests* key contains a mapping of part names to their digests.
 Integrity data is generally stored by the upper layers, introduced below. An exception
 is the DetachedIntegrityCheckedFile, which automatically writes and reads it from
 a ".integrity" file next to the data file.
 It is used for archive chunks indexes in chunks.archive.d.
 Upper layer
 ~~~~~~~~~~~
 Storage of integrity data depends on the component using it, since they have
 different transaction mechanisms, and integrity data needs to be
 transacted with the data it is supposed to protect.
 .. rubric:: Main cache files: chunks and files cache
 The integrity data of the ``chunks`` and ``files`` caches is stored in the
 cache ``config``, since all three are transacted together.
 The ``[integrity]`` section is used:
 .. code-block:: ini
    [cache]
    version = 1
    repository = 3c4...e59
    manifest = 10e...21c
    timestamp = 2017-06-01T21:31:39.699514
    key_type = 2
    previous_location = /path/to/repo
    [integrity]
    manifest = 10e...21c
    chunks = {"algorithm": "XXH64", "digests": {"HashHeader": "eab...39e3", "final": "e2a...b24"}}
 The manifest ID is duplicated in the integrity section due to the way all Borg
 versions handle the config file. Instead of creating a "new" config file from
 an internal representation containing only the data understood by Borg,
 the config file is read in entirety (using the Python ConfigParser) and modified.
 This preserves all sections and values not understood by the Borg version
 modifying it.
 Thus, if an older versions uses a cache with integrity data, it would preserve
 the integrity section and its contents. If a integrity-aware Borg version
 would read this cache, it would incorrectly report checksum errors, since
 the older version did not update the checksums.
 However, by duplicating the manifest ID in the integrity section, it is
 easy to tell whether the checksums concern the current state of the cache.
 Integrity errors are fatal in these files, terminating the program,
 and are not automatically corrected at this time.
 .. rubric:: chunks.archive.d
 Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile,
 which writes the integrity data to a separate ".integrity" file.
 Integrity errors result in deleting the affected index and rebuilding it.
 This logs a warning and increases the exit code to WARNING (1).
 .. rubric:: Repository index and hints
 The repository associates index and hints files with a transaction by including the
 transaction ID in the file names. Integrity data is stored in a third file
 ("integrity.<TRANSACTION_ID>"). Like the hints file, it is msgpacked:
 .. code-block:: python
    {
        b'version': 2,
        b'hints': b'{"algorithm": "XXH64", "digests": {"final": "411208db2aa13f1a"}}',
        b'index': b'{"algorithm": "XXH64", "digests": {"HashHeader": "846b7315f91b8e48", "final": "cb3e26cadc173e40"}}'
    }
 The *version* key started at 2, the same version used for the hints. Since Borg has
 many versioned file formats, this keeps the number of different versions in use
 a bit lower.
 The other keys map an auxiliary file, like *index* or *hints* to their integrity data.
 Note that the JSON is stored as-is, and not as part of the msgpack structure.
 Integrity errors result in deleting the affected file(s) (index/hints) and rebuilding the index,
 which is the same action taken when corruption is noticed in other ways (e.g. HashIndex can
 detect most corrupted headers, but not data corruption). A warning is logged as well.
 The exit code is not influenced, since remote repositories cannot perform that action.
 Raising the exit code would be possible for local repositories, but is not implemented.
 Unlike the cache design this mechanism can have false positives whenever an older version
 *rewrites* the auxiliary files for a transaction created by a newer version,
 since that might result in a different index (due to hash-table resizing) or hints file
 (hash ordering, or the older version 1 format), while not invalidating the integrity file.
 For example, using 1.1 on a repository, noticing corruption or similar issues and then running
 ``borg-1.0 check --repair``, which rewrites the index and hints, results in this situation.
 Borg 1.1 would erroneously report checksum errors in the hints and/or index files and trigger
 an automatic rebuild of these files.