diff --git a/docs/internals.rst b/docs/internals.rst index 30237e61a..6dfc8ba9b 100644 --- a/docs/internals.rst +++ b/docs/internals.rst @@ -168,13 +168,27 @@ A chunk is stored as an object as well, of course. Chunks ------ -|project_name| uses a rolling hash computed by the Buzhash_ algorithm, with a -window size of 4095 bytes (`0xFFF`), with a minimum chunk size of 1024 bytes. -It triggers (chunks) when the last 16 bits of the hash are zero, producing -chunks of 64kiB on average. +The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm. +It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero, +producing chunks of 2^HASH_MASK_BITS Bytes on average. + +create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE +can be used to tune the chunker parameters, the default is: + +- CHUNK_MIN_EXP = 10 (minimum chunk size = 2^10 B = 1 kiB) +- CHUNK_MAX_EXP = 23 (maximum chunk size = 2^23 B = 8 MiB) +- HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB) +- HASH_WINDOW_SIZE = 4095 [B] (`0xFFF`) + +The default parameters are OK for relatively small backup data volumes and +repository sizes and a lot of available memory (RAM) and disk space for the +chunk index. If that does not apply, you are advised to tune these parameters +to keep the chunk count lower than with the defaults. The buzhash table is altered by XORing it with a seed randomly generated once -for the archive, and stored encrypted in the keyfile. +for the archive, and stored encrypted in the keyfile. This is to prevent chunk +size based fingerprinting attacks on your encrypted repo contents (to guess +what files you have based on a specific set of chunk sizes). Indexes / Caches @@ -243,7 +257,7 @@ Indexes / Caches memory usage Here is the estimated memory usage of |project_name|: - chunk_count ~= total_file_size / 65536 + chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS repo_index_usage = chunk_count * 40 @@ -252,20 +266,32 @@ Here is the estimated memory usage of |project_name|: files_cache_usage = total_file_count * 240 + chunk_count * 80 mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage - = total_file_count * 240 + total_file_size / 400 + = chunk_count * 164 + total_file_count * 240 All units are Bytes. -It is assuming every chunk is referenced exactly once and that typical chunk size is 64kiB. +It is assuming every chunk is referenced exactly once (if you have a lot of +duplicate chunks, you will have less chunks than estimated above). + +It is also assuming that typical chunk size is 2^HASH_MASK_BITS (if you have +a lot of files smaller than this statistical medium chunk size, you will have +more chunks than estimated above, because 1 file is at least 1 chunk). If a remote repository is used the repo index will be allocated on the remote side. -E.g. backing up a total count of 1Mi files with a total size of 1TiB: +E.g. backing up a total count of 1Mi files with a total size of 1TiB. - mem_usage = 1 * 2**20 * 240 + 1 * 2**40 / 400 = 2.8GiB +a) with create --chunker-params 10,23,16,4095 (default): -Note: there is a commandline option to switch off the files cache. You'll save -some memory, but it will need to read / chunk all the files then. + mem_usage = 2.8GiB + +b) with create --chunker-params 10,23,20,4095 (custom): + + mem_usage = 0.4GiB + +Note: there is also the --no-files-cache option to switch off the files cache. +You'll save some memory, but it will need to read / chunk all the files then as +it can not skip unmodified files then. Encryption @@ -291,6 +317,7 @@ Encryption keys are either derived from a passphrase or kept in a key file. The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable or prompted for interactive usage. + Key files --------- @@ -355,4 +382,10 @@ representation of the repository id. Compression ----------- -Currently, compression is disabled by default. Zlib compression can be enabled by passing ``--compression level`` on the command line. Level can be anything from 0 (no compression, fast) to 9 (high compression, slow). +|project_name| currently always pipes all data through a zlib compressor which +supports compression levels 0 (no compression, fast) to 9 (high compression, slow). + +See ``borg create --help`` about how to specify the compression level and its default. + +Note: zlib level 0 creates a little bit more output data than it gets as input, +due to zlib protocol overhead.