1
0
Fork 0
mirror of https://github.com/borgbackup/borg.git synced 2025-03-10 14:15:43 +00:00

update internals doc about chunker params, memory usage and compression

This commit is contained in:
Thomas Waldmann 2015-07-14 00:43:35 +02:00
parent b2f460d591
commit b5bdb52b6a

View file

@ -168,13 +168,27 @@ A chunk is stored as an object as well, of course.
Chunks Chunks
------ ------
|project_name| uses a rolling hash computed by the Buzhash_ algorithm, with a The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm.
window size of 4095 bytes (`0xFFF`), with a minimum chunk size of 1024 bytes. It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
It triggers (chunks) when the last 16 bits of the hash are zero, producing producing chunks of 2^HASH_MASK_BITS Bytes on average.
chunks of 64kiB on average.
create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE
can be used to tune the chunker parameters, the default is:
- CHUNK_MIN_EXP = 10 (minimum chunk size = 2^10 B = 1 kiB)
- CHUNK_MAX_EXP = 23 (maximum chunk size = 2^23 B = 8 MiB)
- HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB)
- HASH_WINDOW_SIZE = 4095 [B] (`0xFFF`)
The default parameters are OK for relatively small backup data volumes and
repository sizes and a lot of available memory (RAM) and disk space for the
chunk index. If that does not apply, you are advised to tune these parameters
to keep the chunk count lower than with the defaults.
The buzhash table is altered by XORing it with a seed randomly generated once The buzhash table is altered by XORing it with a seed randomly generated once
for the archive, and stored encrypted in the keyfile. for the archive, and stored encrypted in the keyfile. This is to prevent chunk
size based fingerprinting attacks on your encrypted repo contents (to guess
what files you have based on a specific set of chunk sizes).
Indexes / Caches Indexes / Caches
@ -243,7 +257,7 @@ Indexes / Caches memory usage
Here is the estimated memory usage of |project_name|: Here is the estimated memory usage of |project_name|:
chunk_count ~= total_file_size / 65536 chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
repo_index_usage = chunk_count * 40 repo_index_usage = chunk_count * 40
@ -252,20 +266,32 @@ Here is the estimated memory usage of |project_name|:
files_cache_usage = total_file_count * 240 + chunk_count * 80 files_cache_usage = total_file_count * 240 + chunk_count * 80
mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage
= total_file_count * 240 + total_file_size / 400 = chunk_count * 164 + total_file_count * 240
All units are Bytes. All units are Bytes.
It is assuming every chunk is referenced exactly once and that typical chunk size is 64kiB. It is assuming every chunk is referenced exactly once (if you have a lot of
duplicate chunks, you will have less chunks than estimated above).
It is also assuming that typical chunk size is 2^HASH_MASK_BITS (if you have
a lot of files smaller than this statistical medium chunk size, you will have
more chunks than estimated above, because 1 file is at least 1 chunk).
If a remote repository is used the repo index will be allocated on the remote side. If a remote repository is used the repo index will be allocated on the remote side.
E.g. backing up a total count of 1Mi files with a total size of 1TiB: E.g. backing up a total count of 1Mi files with a total size of 1TiB.
mem_usage = 1 * 2**20 * 240 + 1 * 2**40 / 400 = 2.8GiB a) with create --chunker-params 10,23,16,4095 (default):
Note: there is a commandline option to switch off the files cache. You'll save mem_usage = 2.8GiB
some memory, but it will need to read / chunk all the files then.
b) with create --chunker-params 10,23,20,4095 (custom):
mem_usage = 0.4GiB
Note: there is also the --no-files-cache option to switch off the files cache.
You'll save some memory, but it will need to read / chunk all the files then as
it can not skip unmodified files then.
Encryption Encryption
@ -291,6 +317,7 @@ Encryption keys are either derived from a passphrase or kept in a key file.
The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable
or prompted for interactive usage. or prompted for interactive usage.
Key files Key files
--------- ---------
@ -355,4 +382,10 @@ representation of the repository id.
Compression Compression
----------- -----------
Currently, compression is disabled by default. Zlib compression can be enabled by passing ``--compression level`` on the command line. Level can be anything from 0 (no compression, fast) to 9 (high compression, slow). |project_name| currently always pipes all data through a zlib compressor which
supports compression levels 0 (no compression, fast) to 9 (high compression, slow).
See ``borg create --help`` about how to specify the compression level and its default.
Note: zlib level 0 creates a little bit more output data than it gets as input,
due to zlib protocol overhead.