mirror of
https://github.com/borgbackup/borg.git
synced 2025-01-24 08:19:03 +00:00
reorder to be more logical and more gentle
This commit is contained in:
parent
ddca3b856b
commit
688ba109ef
1 changed files with 55 additions and 49 deletions
|
@ -33,43 +33,25 @@ index.%d
|
|||
cache of the file indexes. those files can be regenerated with
|
||||
``check --repair``
|
||||
|
||||
Indexes and memory usage
|
||||
------------------------
|
||||
Repository config file
|
||||
----------------------
|
||||
|
||||
Repository index
|
||||
40 bytes x N ~ 200MB (If a remote repository is
|
||||
used this will be allocated on the remote side)
|
||||
Each repository has a ``config`` file which which is a ``INI``
|
||||
formatted file which looks like this:
|
||||
|
||||
Chunk lookup index
|
||||
44 bytes x N ~ 220MB
|
||||
[repository]
|
||||
version = 1
|
||||
segments_per_dir = 10000
|
||||
max_segment_size = 5242880
|
||||
id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
|
||||
|
||||
File chunk cache
|
||||
probably 80-100 bytes x N ~ 400MB
|
||||
This is where the ``repository.id`` is stored. It is a unique
|
||||
identifier for repositories. It will not change if you move the
|
||||
repository around so you can make a local transfer then decide to move
|
||||
the repository in another (even remote) location at a later time.
|
||||
|
||||
The chunk lookup index (chunk hash -> reference count, size, ciphered
|
||||
size ; in file cache/chunk) and the repository index (chunk hash ->
|
||||
segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
|
||||
table, directly mapped in memory from the file content, with only one
|
||||
slot per bucket, but that spreads the collisions to the following
|
||||
buckets. As a consequence the hash is just a start position for a linear
|
||||
search, and if the element is not in the table the index is linearly
|
||||
crossed until an empty bucket is found. When the table is full at 90%
|
||||
its size is doubled, when it's empty at 25% its size is halfed. So
|
||||
operations on it have a variable complexity between constant and linear
|
||||
with low factor, and memory overhead varies between 10% and 300%.
|
||||
|
||||
The file chunk cache (file path hash -> age, inode number, size,
|
||||
mtime_ns, chunks hashes ; in file cache/files) is stored as a python
|
||||
associative array storing python objects, which generate a lot of
|
||||
overhead. This takes around 240 bytes per file without the chunk
|
||||
list, to be compared to at most 64 bytes of real data (depending on data
|
||||
alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
|
||||
of ~250 bytes even if only one chunck hash. The inode number is stored
|
||||
to make sure we distinguish between different files, as a single path
|
||||
may not be unique accross different archives in different setups.
|
||||
|
||||
The ``index.%d`` files are random access but those files can be
|
||||
recreated if damaged or lost using ``check --repair``.
|
||||
|project_name| will do a POSIX read lock on that file when operating
|
||||
on the repository.
|
||||
|
||||
Repository structure
|
||||
--------------------
|
||||
|
@ -115,7 +97,7 @@ beyond the ``MAX_OBJECT_SIZE`` barrier of 20MB.
|
|||
A chunk is an object as well, of course, and its id is the hash of its
|
||||
(unencrypted and uncompressed) content.
|
||||
|
||||
Hints are stored in a file (repo/hints) and contain: version, list of
|
||||
Hints are stored in a file (``repo/hints``) and contain: version, list of
|
||||
segments, compact.
|
||||
|
||||
Chunks
|
||||
|
@ -128,25 +110,49 @@ average. All these parameters are fixed. The buzhash table is altered
|
|||
by XORing it with a seed randomly generated once for the archive, and
|
||||
stored encrypted in the keyfile.
|
||||
|
||||
Repository config file
|
||||
----------------------
|
||||
Indexes
|
||||
-------
|
||||
|
||||
Each repository has a ``config`` file which which is a ``INI``
|
||||
formatted file which looks like this:
|
||||
The chunk lookup index (chunk hash -> reference count, size, ciphered
|
||||
size ; in file cache/chunk) and the repository index (chunk hash ->
|
||||
segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
|
||||
table, directly mapped in memory from the file content, with only one
|
||||
slot per bucket, but that spreads the collisions to the following
|
||||
buckets. As a consequence the hash is just a start position for a linear
|
||||
search, and if the element is not in the table the index is linearly
|
||||
crossed until an empty bucket is found. When the table is full at 90%
|
||||
its size is doubled, when it's empty at 25% its size is halfed. So
|
||||
operations on it have a variable complexity between constant and linear
|
||||
with low factor, and memory overhead varies between 10% and 300%.
|
||||
|
||||
[repository]
|
||||
version = 1
|
||||
segments_per_dir = 10000
|
||||
max_segment_size = 5242880
|
||||
id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
|
||||
The file chunk cache (file path hash -> age, inode number, size,
|
||||
mtime_ns, chunks hashes ; in file cache/files) is stored as a python
|
||||
associative array storing python objects, which generate a lot of
|
||||
overhead. This takes around 240 bytes per file without the chunk
|
||||
list, to be compared to at most 64 bytes of real data (depending on data
|
||||
alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
|
||||
of ~250 bytes even if only one chunck hash. The inode number is stored
|
||||
to make sure we distinguish between different files, as a single path
|
||||
may not be unique accross different archives in different setups.
|
||||
|
||||
This is where the ``repository.id`` is stored. It is a unique
|
||||
identifier for repositories. It will not change if you move the
|
||||
repository around so you can make a local transfer then decide to move
|
||||
the repository in another (even remote) location at a later time.
|
||||
The ``index.%d`` files are random access but those files can be
|
||||
recreated if damaged or lost using ``check --repair``.
|
||||
|
||||
|project_name| will do a POSIX read lock on that file when operating
|
||||
on the repository.
|
||||
Indexes memory usage
|
||||
--------------------
|
||||
|
||||
Here is the estimated memory usage of |project_name| when using those
|
||||
indexes:
|
||||
|
||||
Repository index
|
||||
40 bytes x N ~ 200MB (If a remote repository is
|
||||
used this will be allocated on the remote side)
|
||||
|
||||
Chunk lookup index
|
||||
44 bytes x N ~ 220MB
|
||||
|
||||
File chunk cache
|
||||
probably 80-100 bytes x N ~ 400MB
|
||||
|
||||
Encryption
|
||||
----------
|
||||
|
|
Loading…
Reference in a new issue