From 5b297849d38843164e180bece29fb00d87c6e7a9 Mon Sep 17 00:00:00 2001
From: enkore <public@enkore.de>
Date: Sat, 27 Nov 2021 14:41:24 +0000
Subject: [PATCH 1/3]  docs/data-structures: add content-defined chunking
 explainer

---
 docs/internals/data-structures.rst | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst
index 1b2174e86..bff045666 100644
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@@ -615,13 +615,30 @@ with data and seeking over the empty hole ranges).
 +++++++++++++++++
 
 The buzhash chunker triggers (chunks) when the last HASH_MASK_BITS bits of the
-hash are zero, producing chunks with a target size of 2^HASH_MASK_BITS Bytes.
+hash are zero, producing chunks with a target size of 2^HASH_MASK_BITS bytes.
 
 Buzhash is **only** used for cutting the chunks at places defined by the
 content, the buzhash value is **not** used as the deduplication criteria (we
 use a cryptographically strong hash/MAC over the chunk contents for this, the
 id_hash).
 
+The idea of content-defined chunking is assigning every byte where a
+cut *could* be placed a hash. The hash is based on some number of bytes
+(the window size) before the byte in question. Chunks are cut
+where the hash satisfies some condition
+(usually "n numbers of trailing/leading zeroes").
+
+Using normal hash functions this would be extremely slow,
+requiring hashing ``window size * file size`` bytes.
+A rolling hash is used instead, which allows to add a new input byte and
+compute a new hash as well as *remove* a previously added input byte
+from the computed hash. This makes the cost of computing a hash for each
+input byte largely independent of the window size.
+
+Borg defines minimum and maximum chunk sizes (CHUNK_MIN_EXP and CHUNK_MAX_EXP, respectively)
+which narrows down where cuts may be made, greatly reducing the amount of data
+that is actually hashed for content-defined chunking.
+
 ``borg create --chunker-params buzhash,CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE``
 can be used to tune the chunker parameters, the default is:
 

From 79cb4e43e5435a63c21865eb7374d0e0dabc47f4 Mon Sep 17 00:00:00 2001
From: enkore <public@enkore.de>
Date: Sat, 27 Nov 2021 18:45:19 +0000
Subject: [PATCH 2/3] docs/data-structures: tie CDC back into dedup rationale

---
 docs/internals/data-structures.rst | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst
index bff045666..08b0b84d9 100644
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@@ -626,7 +626,11 @@ The idea of content-defined chunking is assigning every byte where a
 cut *could* be placed a hash. The hash is based on some number of bytes
 (the window size) before the byte in question. Chunks are cut
 where the hash satisfies some condition
-(usually "n numbers of trailing/leading zeroes").
+(usually "n numbers of trailing/leading zeroes"). This causes chunks to be cut
+in the same location relative to the file's contents, even if bytes are inserted
+or removed before/after a cut, as long as the bytes within the window stay the same.
+This results in a high chance that a single cluster of changes to a file will only
+result in 1-2 new chunks, aiding deduplication.
 
 Using normal hash functions this would be extremely slow,
 requiring hashing ``window size * file size`` bytes.

From 94e93ba7e6fef1fd0257c7e3f84539e3ed828e70 Mon Sep 17 00:00:00 2001
From: Thomas Waldmann <tw@waldmann-edv.de>
Date: Sun, 16 Jan 2022 20:39:29 +0100
Subject: [PATCH 3/3] formula is only approximately correct

the movement of the start of the hashing window stops at (file_size - window_size), thus THAT would be the factor in that formula, not just file_size.

for medium and big files, window_size is much smaller than file_size, so guess we can just say "approximately" for the general case.
---
 docs/internals/data-structures.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst
index 08b0b84d9..6d1b4ab07 100644
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@@ -633,7 +633,7 @@ This results in a high chance that a single cluster of changes to a file will on
 result in 1-2 new chunks, aiding deduplication.
 
 Using normal hash functions this would be extremely slow,
-requiring hashing ``window size * file size`` bytes.
+requiring hashing approximately ``window size * file size`` bytes.
 A rolling hash is used instead, which allows to add a new input byte and
 compute a new hash as well as *remove* a previously added input byte
 from the computed hash. This makes the cost of computing a hash for each