Merge pull request #2591 from enkore/issue/1101.design

docs: file integrity
This commit is contained in:
enkore 2017-06-03 18:17:14 +02:00 committed by GitHub
commit d95551736d
1 changed files with 183 additions and 0 deletions

View File

@ -715,3 +715,186 @@ In case you run into troubles with the locks, you can use the ``borg break-lock`
command after you first have made sure that no |project_name| process is command after you first have made sure that no |project_name| process is
running on any machine that accesses this resource. Be very careful, the cache running on any machine that accesses this resource. Be very careful, the cache
or repository might get damaged if multiple processes use it at the same time. or repository might get damaged if multiple processes use it at the same time.
Checksumming data structures
----------------------------
As detailed in the previous sections, Borg generates and stores various files
containing important meta data, such as the repository index, repository hints,
chunks caches and files cache.
Data corruption in these files can damage the archive data in a repository,
e.g. due to wrong reference counts in the chunks cache. Only some parts of Borg
were designed to handle corrupted data structures, so a corrupted files cache
may cause crashes or write incorrect archives.
Therefore, Borg calculates checksums when writing these files and tests checksums
when reading them. Checksums are generally 64-bit XXH64 hashes.
The canonical xxHash representation is used, i.e. big-endian.
Checksums are stored as hexadecimal ASCII strings.
For compatibility, checksums are not required and absent checksums do not trigger errors.
The mechanisms have been designed to avoid false-positives when various Borg
versions are used alternately on the same repositories.
Checksums are a data safety mechanism. They are not a security mechanism.
.. rubric:: Choice of algorithm
XXH64 has been chosen for its high speed on all platforms, which avoids performance
degradation in CPU-limited parts (e.g. cache synchronization).
Unlike CRC32, it neither requires hardware support (crc32c or CLMUL)
nor vectorized code nor large, cache-unfriendly lookup tables to achieve good performance.
This simplifies deployment of it considerably (cf. src/borg/algorithms/crc32...).
Further, XXH64 is a non-linear hash function and thus has a "more or less" good
chance to detect larger burst errors, unlike linear CRCs where the probability
of detection decreases with error size.
The 64-bit checksum length is considered sufficient for the file sizes typically
checksummed (individual files up to a few GB, usually less).
xxHash was expressly designed for data blocks of these sizes.
Lower layer — file_integrity
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To accommodate the different transaction models used for the cache and repository,
there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile)
wrapping a file-like object, performing streaming calculation and comparison of checksums.
Checksum errors are signalled by raising an exception (borg.crypto.file_integrity.FileIntegrityError)
at the earliest possible moment.
.. rubric:: Calculating checksums
Before feeding the checksum algorithm any data, the file name (i.e. without any path)
is mixed into the checksum, since the name encodes the context of the data for Borg.
The various indices used by Borg have separate header and main data parts.
IntegrityCheckedFile allows to checksum them independently, which avoids
even reading the data when the header is corrupted. When a part is signalled,
the length of the part name is mixed into the checksum state first (encoded
as an ASCII string via `%10d` printf format), then the name of the part
is mixed in as an UTF-8 string. Lastly, the current position (length)
in the file is mixed in as well.
The checksum state is not reset at part boundaries.
A final checksum is always calculated in the same way as the parts described above,
after seeking to the end of the file. The final checksum cannot prevent code
from processing corrupted data during reading, however, it prevents use of the
corrupted data.
.. rubric:: Serializing checksums
All checksums are compiled into a simple JSON structure called *integrity data*:
.. code-block:: json
{
"algorithm": "XXH64",
"digests": {
"HashHeader": "eab6802590ba39e3",
"final": "e2a7f132fc2e8b24"
}
}
The *algorithm* key notes the used algorithm. When reading, integrity data containing
an unknown algorithm is not inspected further.
The *digests* key contains a mapping of part names to their digests.
Integrity data is generally stored by the upper layers, introduced below. An exception
is the DetachedIntegrityCheckedFile, which automatically writes and reads it from
a ".integrity" file next to the data file.
It is used for archive chunks indexes in chunks.archive.d.
Upper layer
~~~~~~~~~~~
Storage of integrity data depends on the component using it, since they have
different transaction mechanisms, and integrity data needs to be
transacted with the data it is supposed to protect.
.. rubric:: Main cache files: chunks and files cache
The integrity data of the ``chunks`` and ``files`` caches is stored in the
cache ``config``, since all three are transacted together.
The ``[integrity]`` section is used:
.. code-block:: ini
[cache]
version = 1
repository = 3c4...e59
manifest = 10e...21c
timestamp = 2017-06-01T21:31:39.699514
key_type = 2
previous_location = /path/to/repo
[integrity]
manifest = 10e...21c
chunks = {"algorithm": "XXH64", "digests": {"HashHeader": "eab...39e3", "final": "e2a...b24"}}
The manifest ID is duplicated in the integrity section due to the way all Borg
versions handle the config file. Instead of creating a "new" config file from
an internal representation containing only the data understood by Borg,
the config file is read in entirety (using the Python ConfigParser) and modified.
This preserves all sections and values not understood by the Borg version
modifying it.
Thus, if an older versions uses a cache with integrity data, it would preserve
the integrity section and its contents. If a integrity-aware Borg version
would read this cache, it would incorrectly report checksum errors, since
the older version did not update the checksums.
However, by duplicating the manifest ID in the integrity section, it is
easy to tell whether the checksums concern the current state of the cache.
Integrity errors are fatal in these files, terminating the program,
and are not automatically corrected at this time.
.. rubric:: chunks.archive.d
Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile,
which writes the integrity data to a separate ".integrity" file.
Integrity errors result in deleting the affected index and rebuilding it.
This logs a warning and increases the exit code to WARNING (1).
.. rubric:: Repository index and hints
The repository associates index and hints files with a transaction by including the
transaction ID in the file names. Integrity data is stored in a third file
("integrity.<TRANSACTION_ID>"). Like the hints file, it is msgpacked:
.. code-block:: python
{
b'version': 2,
b'hints': b'{"algorithm": "XXH64", "digests": {"final": "411208db2aa13f1a"}}',
b'index': b'{"algorithm": "XXH64", "digests": {"HashHeader": "846b7315f91b8e48", "final": "cb3e26cadc173e40"}}'
}
The *version* key started at 2, the same version used for the hints. Since Borg has
many versioned file formats, this keeps the number of different versions in use
a bit lower.
The other keys map an auxiliary file, like *index* or *hints* to their integrity data.
Note that the JSON is stored as-is, and not as part of the msgpack structure.
Integrity errors result in deleting the affected file(s) (index/hints) and rebuilding the index,
which is the same action taken when corruption is noticed in other ways (e.g. HashIndex can
detect most corrupted headers, but not data corruption). A warning is logged as well.
The exit code is not influenced, since remote repositories cannot perform that action.
Raising the exit code would be possible for local repositories, but is not implemented.
Unlike the cache design this mechanism can have false positives whenever an older version
*rewrites* the auxiliary files for a transaction created by a newer version,
since that might result in a different index (due to hash-table resizing) or hints file
(hash ordering, or the older version 1 format), while not invalidating the integrity file.
For example, using 1.1 on a repository, noticing corruption or similar issues and then running
``borg-1.0 check --repair``, which rewrites the index and hints, results in this situation.
Borg 1.1 would erroneously report checksum errors in the hints and/or index files and trigger
an automatic rebuild of these files.