diff --git a/README.rst b/README.rst index 6655e77a6..2dc2a798d 100644 --- a/README.rst +++ b/README.rst @@ -27,6 +27,10 @@ Main features of bytes stored: each file is split into a number of variable length chunks and only chunks that have never been seen before are added to the repository. + A chunk is considered duplicate if its id_hash value is identical. + A cryptographically strong hash or MAC function is used as id_hash, e.g. + (hmac-)sha256. + To deduplicate, all the chunks in the same repository are considered, no matter whether they come from different machines, from previous backups, from the same backup or even from the same single file. diff --git a/docs/internals.rst b/docs/internals.rst index 138761b2d..a5e00eff3 100644 --- a/docs/internals.rst +++ b/docs/internals.rst @@ -96,6 +96,8 @@ The id_hash function is: * sha256 (no encryption keys available) * hmac-sha256 (encryption keys available) +As the id / key is used for deduplication, id_hash must be a cryptographically +strong hash or MAC. Segments and archives --------------------- @@ -233,6 +235,11 @@ The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorith It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero, producing chunks of 2^HASH_MASK_BITS Bytes on average. +Buzhash is **only** used for cutting the chunks at places defined by the +content, the buzhash value is **not** used as the deduplication criteria (we +use a cryptographically strong hash/MAC over the chunk contents for this, the +id_hash). + ``borg create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE`` can be used to tune the chunker parameters, the default is: