diff --git a/docs/internals.rst b/docs/internals.rst index 0e01336bb..ede9fb0ef 100644 --- a/docs/internals.rst +++ b/docs/internals.rst @@ -33,43 +33,25 @@ index.%d cache of the file indexes. those files can be regenerated with ``check --repair`` -Indexes and memory usage ------------------------- +Repository config file +---------------------- -Repository index - 40 bytes x N ~ 200MB (If a remote repository is - used this will be allocated on the remote side) +Each repository has a ``config`` file which which is a ``INI`` +formatted file which looks like this: -Chunk lookup index - 44 bytes x N ~ 220MB + [repository] + version = 1 + segments_per_dir = 10000 + max_segment_size = 5242880 + id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6 -File chunk cache - probably 80-100 bytes x N ~ 400MB +This is where the ``repository.id`` is stored. It is a unique +identifier for repositories. It will not change if you move the +repository around so you can make a local transfer then decide to move +the repository in another (even remote) location at a later time. -The chunk lookup index (chunk hash -> reference count, size, ciphered -size ; in file cache/chunk) and the repository index (chunk hash -> -segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash -table, directly mapped in memory from the file content, with only one -slot per bucket, but that spreads the collisions to the following -buckets. As a consequence the hash is just a start position for a linear -search, and if the element is not in the table the index is linearly -crossed until an empty bucket is found. When the table is full at 90% -its size is doubled, when it's empty at 25% its size is halfed. So -operations on it have a variable complexity between constant and linear -with low factor, and memory overhead varies between 10% and 300%. - -The file chunk cache (file path hash -> age, inode number, size, -mtime_ns, chunks hashes ; in file cache/files) is stored as a python -associative array storing python objects, which generate a lot of -overhead. This takes around 240 bytes per file without the chunk -list, to be compared to at most 64 bytes of real data (depending on data -alignment), and around 80 bytes per chunk hash (vs 32), with a minimum -of ~250 bytes even if only one chunck hash. The inode number is stored -to make sure we distinguish between different files, as a single path -may not be unique accross different archives in different setups. - -The ``index.%d`` files are random access but those files can be -recreated if damaged or lost using ``check --repair``. +|project_name| will do a POSIX read lock on that file when operating +on the repository. Repository structure -------------------- @@ -115,7 +97,7 @@ beyond the ``MAX_OBJECT_SIZE`` barrier of 20MB. A chunk is an object as well, of course, and its id is the hash of its (unencrypted and uncompressed) content. -Hints are stored in a file (repo/hints) and contain: version, list of +Hints are stored in a file (``repo/hints``) and contain: version, list of segments, compact. Chunks @@ -128,25 +110,49 @@ average. All these parameters are fixed. The buzhash table is altered by XORing it with a seed randomly generated once for the archive, and stored encrypted in the keyfile. -Repository config file ----------------------- +Indexes +------- -Each repository has a ``config`` file which which is a ``INI`` -formatted file which looks like this: +The chunk lookup index (chunk hash -> reference count, size, ciphered +size ; in file cache/chunk) and the repository index (chunk hash -> +segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash +table, directly mapped in memory from the file content, with only one +slot per bucket, but that spreads the collisions to the following +buckets. As a consequence the hash is just a start position for a linear +search, and if the element is not in the table the index is linearly +crossed until an empty bucket is found. When the table is full at 90% +its size is doubled, when it's empty at 25% its size is halfed. So +operations on it have a variable complexity between constant and linear +with low factor, and memory overhead varies between 10% and 300%. - [repository] - version = 1 - segments_per_dir = 10000 - max_segment_size = 5242880 - id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6 +The file chunk cache (file path hash -> age, inode number, size, +mtime_ns, chunks hashes ; in file cache/files) is stored as a python +associative array storing python objects, which generate a lot of +overhead. This takes around 240 bytes per file without the chunk +list, to be compared to at most 64 bytes of real data (depending on data +alignment), and around 80 bytes per chunk hash (vs 32), with a minimum +of ~250 bytes even if only one chunck hash. The inode number is stored +to make sure we distinguish between different files, as a single path +may not be unique accross different archives in different setups. -This is where the ``repository.id`` is stored. It is a unique -identifier for repositories. It will not change if you move the -repository around so you can make a local transfer then decide to move -the repository in another (even remote) location at a later time. +The ``index.%d`` files are random access but those files can be +recreated if damaged or lost using ``check --repair``. -|project_name| will do a POSIX read lock on that file when operating -on the repository. +Indexes memory usage +-------------------- + +Here is the estimated memory usage of |project_name| when using those +indexes: + +Repository index + 40 bytes x N ~ 200MB (If a remote repository is + used this will be allocated on the remote side) + +Chunk lookup index + 44 bytes x N ~ 220MB + +File chunk cache + probably 80-100 bytes x N ~ 400MB Encryption ----------