This commit is contained in:
Marian Beermann 2017-06-03 13:04:05 +02:00
parent 45ee62e5ea
commit b8e40fdce6
1 changed files with 30 additions and 14 deletions

View File

@ -729,11 +729,22 @@ were designed to handle corrupted data structures, so a corrupted files cache
may cause crashes or write incorrect archives.
Therefore, Borg calculates checksums when writing these files and tests checksums
when reading them. Checksums are generally 64-bit XXH64 checksums.
when reading them. Checksums are generally 64-bit XXH64 hashes.
The canonical xxHash representation is used, i.e. big-endian.
Checksums are stored as hexadecimal ASCII strings.
For compatibility, checksums are not required and absent checksums do not trigger errors.
The mechanisms have been designed to avoid false-positives when various Borg
versions are used alternately on the same repositories.
Checksums are a data safety mechanism. They are not a security mechanism.
.. rubric:: Choice of algorithm
XXH64 has been chosen for its high speed on all platforms, which avoids performance
degradation in CPU-limited parts (e.g. cache synchronization). Unlike CRC32,
it does neither require hardware support (crc32c or CLMUL) nor vectorized code
nor large, cache-unfriendly lookup tables to achieve good performance.
degradation in CPU-limited parts (e.g. cache synchronization).
Unlike CRC32, it neither requires hardware support (crc32c or CLMUL)
nor vectorized code nor large, cache-unfriendly lookup tables to achieve good performance.
This simplifies deployment of it considerably (cf. src/borg/algorithms/crc32...).
Further, XXH64 is a non-linear hash function and thus has a "more or less" good
@ -742,32 +753,36 @@ of detection decreases with error size.
The 64-bit checksum length is considered sufficient for the file sizes typically
checksummed (individual files up to a few GB, usually less).
The canonical xxHash representation is used, i.e. big-endian.
Checksums are generally stored as hexadecimal ASCII strings.
xxHash was expressly designed for data blocks of these sizes.
Lower layer — file_integrity
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To accommodate the different transaction models used for the cache and repository,
there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile) which
wraps a file-like object and performs streaming calculation and comparison of checksums.
there is a lower layer (borg.crypto.file_integrity.IntegrityCheckedFile)
wrapping a file-like object, performing streaming calculation and comparison of checksums.
Checksum errors are signalled by raising an exception (borg.crypto.file_integrity.FileIntegrityError)
at the earliest possible moment.
.. rubric:: Calculating checksums
Before feeding the checksum algorithm any data, the file name (i.e. without any path)
is mixed into the checksum, since the name encodes the context of the data for Borg.
The various indices used by Borg have separate header and main data parts.
IntegrityCheckedFile allows to checksum them independently, which avoids
even reading the data when the header is corrupted. When a part is signalled,
the length of the pathname is mixed into the checksum state first (encoded
the length of the part name is mixed into the checksum state first (encoded
as an ASCII string via `%10d` printf format), then the name of the part
is mixed in as an UTF-8 string. Lastly, the current position (length)
in the file is mixed in as well.
The checksum state is not reset at part boundaries.
A final checksum is always calculated from the entire state.
A final checksum is always calculated in the same way as the parts described above,
after seeking to the end of the file. The final checksum cannot prevent code
from processing corrupted data during reading, however, it prevents use of the
corrupted data.
.. rubric:: Serializing checksums
@ -790,7 +805,8 @@ The *digests* key contains a mapping of part names to their digests.
Integrity data is generally stored by the upper layers, introduced below. An exception
is the DetachedIntegrityCheckedFile, which automatically writes and reads it from
a ".integrity" file next to the data file. It is used for archive chunks in chunks.archive.d.
a ".integrity" file next to the data file.
It is used for archive chunks indexes in chunks.archive.d.
Upper layer
~~~~~~~~~~~
@ -840,8 +856,8 @@ and are not automatically corrected at this time.
.. rubric:: chunks.archive.d
Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile, which
writes the integrity data to a separate ".integrity" file.
Indices in chunks.archive.d are not transacted and use DetachedIntegrityCheckedFile,
which writes the integrity data to a separate ".integrity" file.
Integrity errors result in deleting the affected index and rebuilding it.
This logs a warning and increases the exit code to WARNING (1).