From 734dae80efbcd929249b41fa76b456b9028fc2e2 Mon Sep 17 00:00:00 2001 From: Thomas Waldmann Date: Mon, 2 Nov 2015 19:47:09 +0100 Subject: [PATCH] improve chunker params docs, fixes #362 --- docs/internals.rst | 8 +++----- docs/usage.rst | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+), 5 deletions(-) diff --git a/docs/internals.rst b/docs/internals.rst index d989fd9c5..2ebed0c5a 100644 --- a/docs/internals.rst +++ b/docs/internals.rst @@ -196,6 +196,7 @@ to the archive metadata. A chunk is stored as an object as well, of course. +.. _chunker_details: Chunks ------ @@ -212,16 +213,13 @@ can be used to tune the chunker parameters, the default is: - HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB) - HASH_WINDOW_SIZE = 4095 [B] (`0xFFF`) -The default parameters are OK for relatively small backup data volumes and -repository sizes and a lot of available memory (RAM) and disk space for the -chunk index. If that does not apply, you are advised to tune these parameters -to keep the chunk count lower than with the defaults. - The buzhash table is altered by XORing it with a seed randomly generated once for the archive, and stored encrypted in the keyfile. This is to prevent chunk size based fingerprinting attacks on your encrypted repo contents (to guess what files you have based on a specific set of chunk sizes). +For some more general usage hints see also `--chunker-params`. + Indexes / Caches ---------------- diff --git a/docs/usage.rst b/docs/usage.rst index 6b88d5c60..5a7ce0ede 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -391,6 +391,48 @@ Additional Notes Here are misc. notes about topics that are maybe not covered in enough detail in the usage section. +--chunker-params +~~~~~~~~~~~~~~~~ +The chunker params influence how input files are cut into pieces (chunks) +which are then considered for deduplication. They also have a big impact on +resource usage (RAM and disk space) as the amount of resources needed is +(also) determined by the total amount of chunks in the repository (see +`Indexes / Caches memory usage` for details). + +`--chunker-params=10,23,16,4095 (default)` results in a fine-grained deduplication +and creates a big amount of chunks and thus uses a lot of resources to manage them. +This is good for relatively small data volumes and if the machine has a good +amount of free RAM and disk space. + +`--chunker-params=19,23,21,4095` results in a coarse-grained deduplication and +creates a much smaller amount of chunks and thus uses less resources. +This is good for relatively big data volumes and if the machine has a relatively +low amount of free RAM and disk space. + +If you already have made some archives in a repository and you then change +chunker params, this of course impacts deduplication as the chunks will be +cut differently. + +In the worst case (all files are big and were touched in between backups), this +will store all content into the repository again. + +Usually, it is not that bad though: +- usually most files are not touched, so it will just re-use the old chunks +it already has in the repo +- files smaller than the (both old and new) minimum chunksize result in only +one chunk anyway, so the resulting chunks are same and deduplication will apply + +If you switch chunker params to save resources for an existing repo that +already has some backup archives, you will see an increasing effect over time, +when more and more files have been touched and stored again using the bigger +chunksize **and** all references to the smaller older chunks have been removed +(by deleting / pruning archives). + +If you want to see an immediate big effect on resource usage, you better start +a new repository when changing chunker params. + +For more details, see :ref:`chunker_details`. + --read-special ~~~~~~~~~~~~~~