update internals doc about chunker params, memory usage and compression

2025-03-10 14:15:43 +00:00 · 2015-07-14 00:43:35 +02:00 · 2015-07-14 00:43:35 +02:00 · b5bdb52b6a
commit b5bdb52b6a
parent b2f460d591
1 changed files with 46 additions and 13 deletions
--- a/docs/internals.rst
+++ b/docs/internals.rst
@ -168,13 +168,27 @@ A chunk is stored as an object as well, of course.
 Chunks
 ------
-|project_name| uses a rolling hash computed by the Buzhash_ algorithm, with a
+The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm.
-window size of 4095 bytes (`0xFFF`), with a minimum chunk size of 1024 bytes.
+It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
-It triggers (chunks) when the last 16 bits of the hash are zero, producing
+producing chunks of 2^HASH_MASK_BITS Bytes on average.
-chunks of 64kiB on average.
+
 create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE
 can be used to tune the chunker parameters, the default is:
 - CHUNK_MIN_EXP = 10 (minimum chunk size = 2^10 B = 1 kiB)
 - CHUNK_MAX_EXP = 23 (maximum chunk size = 2^23 B = 8 MiB)
 - HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB)
 - HASH_WINDOW_SIZE = 4095 [B] (`0xFFF`)
 The default parameters are OK for relatively small backup data volumes and
 repository sizes and a lot of available memory (RAM) and disk space for the
 chunk index. If that does not apply, you are advised to tune these parameters
 to keep the chunk count lower than with the defaults.
 The buzhash table is altered by XORing it with a seed randomly generated once
-for the archive, and stored encrypted in the keyfile.
+for the archive, and stored encrypted in the keyfile. This is to prevent chunk
 size based fingerprinting attacks on your encrypted repo contents (to guess
 what files you have based on a specific set of chunk sizes).
 Indexes / Caches
@ -243,7 +257,7 @@ Indexes / Caches memory usage
 Here is the estimated memory usage of |project_name|:
-  chunk_count ~= total_file_size / 65536
+  chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
  repo_index_usage = chunk_count * 40
@ -252,20 +266,32 @@ Here is the estimated memory usage of |project_name|:
  files_cache_usage = total_file_count * 240 + chunk_count * 80
  mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage
-             = total_file_count * 240 + total_file_size / 400
+             = chunk_count * 164 + total_file_count * 240
 All units are Bytes.
-It is assuming every chunk is referenced exactly once and that typical chunk size is 64kiB.
+It is assuming every chunk is referenced exactly once (if you have a lot of
 duplicate chunks, you will have less chunks than estimated above).
 It is also assuming that typical chunk size is 2^HASH_MASK_BITS (if you have
 a lot of files smaller than this statistical medium chunk size, you will have
 more chunks than estimated above, because 1 file is at least 1 chunk).
 If a remote repository is used the repo index will be allocated on the remote side.
-E.g. backing up a total count of 1Mi files with a total size of 1TiB:
+E.g. backing up a total count of 1Mi files with a total size of 1TiB.
-  mem_usage  =  1 * 2**20 * 240  +  1 * 2**40 / 400  =  2.8GiB
+a) with create --chunker-params 10,23,16,4095 (default):
-Note: there is a commandline option to switch off the files cache. You'll save
+  mem_usage  =  2.8GiB
-some memory, but it will need to read / chunk all the files then.
+
 b) with create --chunker-params 10,23,20,4095 (custom):
  mem_usage  =  0.4GiB
 Note: there is also the --no-files-cache option to switch off the files cache.
 You'll save some memory, but it will need to read / chunk all the files then as
 it can not skip unmodified files then.
 Encryption
@ -291,6 +317,7 @@ Encryption keys are either derived from a passphrase or kept in a key file.
 The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable
 or prompted for interactive usage.
 Key files
 ---------
@ -355,4 +382,10 @@ representation of the repository id.
 Compression
 -----------
-Currently, compression is disabled by default. Zlib compression can be enabled by passing ``--compression level`` on the command line. Level can be anything from 0 (no compression, fast) to 9 (high compression, slow).
+|project_name| currently always pipes all data through a zlib compressor which
 supports compression levels 0 (no compression, fast) to 9 (high compression, slow).
 See ``borg create --help`` about how to specify the compression level and its default.
 Note: zlib level 0 creates a little bit more output data than it gets as input,
 due to zlib protocol overhead.