mirror of
https://github.com/borgbackup/borg.git
synced 2024-12-24 08:45:13 +00:00
docs: data structures: demingle cache and repo index
This commit is contained in:
parent
06980525fa
commit
2b0e7bc924
1 changed files with 65 additions and 44 deletions
|
@ -104,12 +104,37 @@ to the file containing the object id and data. If an object is deleted
|
|||
a ``DELETE`` entry is appended with the object id.
|
||||
|
||||
A ``COMMIT`` tag is written when a repository transaction is
|
||||
committed.
|
||||
committed. The segment number of the segment containing
|
||||
a commit is the **transaction ID**.
|
||||
|
||||
When a repository is opened any ``PUT`` or ``DELETE`` operations not
|
||||
followed by a ``COMMIT`` tag are discarded since they are part of a
|
||||
partial/uncommitted transaction.
|
||||
|
||||
Index, hints and integrity
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The **repository index** is stored in ``index.<TRANSACTION_ID>`` and is used to
|
||||
determine an object's location in the repository. It is a HashIndex_,
|
||||
a hash table using open addressing. It maps object keys_ to two
|
||||
unsigned 32-bit integers; the first integer gives the segment number,
|
||||
the second indicates the offset of the object's entry within the segment.
|
||||
|
||||
The **hints file** is a msgpacked file named ``hints.<TRANSACTION_ID>``.
|
||||
It contains:
|
||||
|
||||
* version
|
||||
* list of segments
|
||||
* compact
|
||||
|
||||
The **integrity file** is a msgpacked file named ``integrity.<TRANSACTION_ID>``.
|
||||
It contains checksums of the index and hints files and is described in the
|
||||
:ref:`Checksumming data structures <integrity_repo>` section below.
|
||||
|
||||
If the index or hints are corrupted, they are re-generated automatically.
|
||||
If they are outdated, segments are replayed from the index state to the currently
|
||||
committed transaction.
|
||||
|
||||
Compaction
|
||||
~~~~~~~~~~
|
||||
|
||||
|
@ -384,13 +409,13 @@ For some more general usage hints see also ``--chunker-params``.
|
|||
|
||||
.. _cache:
|
||||
|
||||
Indexes / Caches
|
||||
----------------
|
||||
The cache
|
||||
---------
|
||||
|
||||
The **files cache** is stored in ``cache/files`` and is used at backup time to
|
||||
quickly determine whether a given file is unchanged and we have all its chunks.
|
||||
|
||||
The files cache is a key -> value mapping and contains:
|
||||
The files cache is in memory a key -> value mapping (a Python *dict*) and contains:
|
||||
|
||||
* key:
|
||||
|
||||
|
@ -438,6 +463,10 @@ Borg can also work without using the files cache (saves memory if you have a
|
|||
lot of files or not much RAM free), then all files are assumed to have changed.
|
||||
This is usually much slower than with files cache.
|
||||
|
||||
The on-disk format of the files cache is a stream of msgpacked tuples (key, value).
|
||||
Loading the files cache involves reading the file, one msgpack object at a time,
|
||||
unpacking it, and msgpacking the value (in an effort to save memory).
|
||||
|
||||
The **chunks cache** is stored in ``cache/chunks`` and is used to determine
|
||||
whether we already have a specific chunk, to count references to it and also
|
||||
for statistics.
|
||||
|
@ -453,46 +482,7 @@ The chunks cache is a key -> value mapping and contains:
|
|||
- size
|
||||
- encrypted/compressed size
|
||||
|
||||
The chunks cache is a hashindex, a hash table implemented in C and tuned for
|
||||
memory efficiency.
|
||||
|
||||
The **repository index** is stored in ``repo/index.%d`` and is used to
|
||||
determine a chunk's location in the repository.
|
||||
|
||||
The repo index is a key -> value mapping and contains:
|
||||
|
||||
* key:
|
||||
|
||||
- chunk id_hash
|
||||
* value:
|
||||
|
||||
- segment (that contains the chunk)
|
||||
- offset (where the chunk is located in the segment)
|
||||
|
||||
The repo index is a hashindex, a hash table implemented in C and tuned for
|
||||
memory efficiency.
|
||||
|
||||
|
||||
Hints are stored in a file (``repo/hints.%d``).
|
||||
|
||||
It contains:
|
||||
|
||||
* version
|
||||
* list of segments
|
||||
* compact
|
||||
|
||||
hints and index can be recreated if damaged or lost using ``check --repair``.
|
||||
|
||||
The chunks cache and the repository index are stored as hash tables, with
|
||||
only one slot per bucket, but that spreads the collisions to the following
|
||||
buckets. As a consequence the hash is just a start position for a linear
|
||||
search, and if the element is not in the table the index is linearly crossed
|
||||
until an empty bucket is found.
|
||||
|
||||
When the hash table is filled to 75%, its size is grown. When it's
|
||||
emptied to 25%, its size is shrinked. So operations on it have a variable
|
||||
complexity between constant and linear with low factor, and memory overhead
|
||||
varies between 33% and 300%.
|
||||
The chunks cache is a HashIndex_.
|
||||
|
||||
.. _cache-memory-usage:
|
||||
|
||||
|
@ -556,6 +546,35 @@ b) with ``create --chunker-params 19,23,21,4095`` (default):
|
|||
You'll save some memory, but it will need to read / chunk all the files as
|
||||
it can not skip unmodified files then.
|
||||
|
||||
HashIndex
|
||||
---------
|
||||
|
||||
The chunks cache and the repository index are stored as hash tables, with
|
||||
only one slot per bucket, spreading hash collisions to the following
|
||||
buckets. As a consequence the hash is just a start position for a linear
|
||||
search, and if the element is not in the table the index is linearly crossed
|
||||
until an empty bucket is found.
|
||||
|
||||
This particular mode of operation is open addressing with linear probing.
|
||||
|
||||
When the hash table is filled to 75%, its size is grown. When it's
|
||||
emptied to 25%, its size is shrinked. Operations on it have a variable
|
||||
complexity between constant and linear with low factor, and memory overhead
|
||||
varies between 33% and 300%.
|
||||
|
||||
Further, if the number of empty slots becomes too low (recall that linear probing
|
||||
for an element not in the index stops at the first empty slot), the hash table
|
||||
is rebuilt. The maximum *effective* load factor is 93%.
|
||||
|
||||
Data in a HashIndex is always stored in little-endian format, which increases
|
||||
efficiency for almost everyone, since basically no one uses big-endian processors
|
||||
any more.
|
||||
|
||||
The format is easy to read and write, because the buckets array has the same layout
|
||||
in memory and on disk. Only the header formats differ.
|
||||
|
||||
.. todo:: Describe HashHeader
|
||||
|
||||
Encryption
|
||||
----------
|
||||
|
||||
|
@ -862,6 +881,8 @@ which writes the integrity data to a separate ".integrity" file.
|
|||
Integrity errors result in deleting the affected index and rebuilding it.
|
||||
This logs a warning and increases the exit code to WARNING (1).
|
||||
|
||||
.. _integrity_repo:
|
||||
|
||||
.. rubric:: Repository index and hints
|
||||
|
||||
The repository associates index and hints files with a transaction by including the
|
||||
|
|
Loading…
Reference in a new issue