reorder to be more logical and more gentle

This commit is contained in:
Antoine Beaupré 2014-12-16 10:35:48 -05:00
parent ddca3b856b
commit 688ba109ef
1 changed files with 55 additions and 49 deletions

View File

@ -33,43 +33,25 @@ index.%d
cache of the file indexes. those files can be regenerated with
``check --repair``
Indexes and memory usage
------------------------
Repository config file
----------------------
Repository index
40 bytes x N ~ 200MB (If a remote repository is
used this will be allocated on the remote side)
Each repository has a ``config`` file which which is a ``INI``
formatted file which looks like this:
Chunk lookup index
44 bytes x N ~ 220MB
[repository]
version = 1
segments_per_dir = 10000
max_segment_size = 5242880
id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
File chunk cache
probably 80-100 bytes x N ~ 400MB
This is where the ``repository.id`` is stored. It is a unique
identifier for repositories. It will not change if you move the
repository around so you can make a local transfer then decide to move
the repository in another (even remote) location at a later time.
The chunk lookup index (chunk hash -> reference count, size, ciphered
size ; in file cache/chunk) and the repository index (chunk hash ->
segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
table, directly mapped in memory from the file content, with only one
slot per bucket, but that spreads the collisions to the following
buckets. As a consequence the hash is just a start position for a linear
search, and if the element is not in the table the index is linearly
crossed until an empty bucket is found. When the table is full at 90%
its size is doubled, when it's empty at 25% its size is halfed. So
operations on it have a variable complexity between constant and linear
with low factor, and memory overhead varies between 10% and 300%.
The file chunk cache (file path hash -> age, inode number, size,
mtime_ns, chunks hashes ; in file cache/files) is stored as a python
associative array storing python objects, which generate a lot of
overhead. This takes around 240 bytes per file without the chunk
list, to be compared to at most 64 bytes of real data (depending on data
alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
of ~250 bytes even if only one chunck hash. The inode number is stored
to make sure we distinguish between different files, as a single path
may not be unique accross different archives in different setups.
The ``index.%d`` files are random access but those files can be
recreated if damaged or lost using ``check --repair``.
|project_name| will do a POSIX read lock on that file when operating
on the repository.
Repository structure
--------------------
@ -115,7 +97,7 @@ beyond the ``MAX_OBJECT_SIZE`` barrier of 20MB.
A chunk is an object as well, of course, and its id is the hash of its
(unencrypted and uncompressed) content.
Hints are stored in a file (repo/hints) and contain: version, list of
Hints are stored in a file (``repo/hints``) and contain: version, list of
segments, compact.
Chunks
@ -128,25 +110,49 @@ average. All these parameters are fixed. The buzhash table is altered
by XORing it with a seed randomly generated once for the archive, and
stored encrypted in the keyfile.
Repository config file
----------------------
Indexes
-------
Each repository has a ``config`` file which which is a ``INI``
formatted file which looks like this:
The chunk lookup index (chunk hash -> reference count, size, ciphered
size ; in file cache/chunk) and the repository index (chunk hash ->
segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
table, directly mapped in memory from the file content, with only one
slot per bucket, but that spreads the collisions to the following
buckets. As a consequence the hash is just a start position for a linear
search, and if the element is not in the table the index is linearly
crossed until an empty bucket is found. When the table is full at 90%
its size is doubled, when it's empty at 25% its size is halfed. So
operations on it have a variable complexity between constant and linear
with low factor, and memory overhead varies between 10% and 300%.
[repository]
version = 1
segments_per_dir = 10000
max_segment_size = 5242880
id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
The file chunk cache (file path hash -> age, inode number, size,
mtime_ns, chunks hashes ; in file cache/files) is stored as a python
associative array storing python objects, which generate a lot of
overhead. This takes around 240 bytes per file without the chunk
list, to be compared to at most 64 bytes of real data (depending on data
alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
of ~250 bytes even if only one chunck hash. The inode number is stored
to make sure we distinguish between different files, as a single path
may not be unique accross different archives in different setups.
This is where the ``repository.id`` is stored. It is a unique
identifier for repositories. It will not change if you move the
repository around so you can make a local transfer then decide to move
the repository in another (even remote) location at a later time.
The ``index.%d`` files are random access but those files can be
recreated if damaged or lost using ``check --repair``.
|project_name| will do a POSIX read lock on that file when operating
on the repository.
Indexes memory usage
--------------------
Here is the estimated memory usage of |project_name| when using those
indexes:
Repository index
40 bytes x N ~ 200MB (If a remote repository is
used this will be allocated on the remote side)
Chunk lookup index
44 bytes x N ~ 220MB
File chunk cache
probably 80-100 bytes x N ~ 400MB
Encryption
----------