1
0
Fork 0
mirror of https://github.com/borgbackup/borg.git synced 2024-12-24 00:37:56 +00:00

improve cache / index docs, esp. files cache docs, fixes #1825

This commit is contained in:
Thomas Waldmann 2016-11-24 01:53:23 +01:00
parent df5482d7fc
commit c8b58e0fd8

View file

@ -252,44 +252,94 @@ For some more general usage hints see also ``--chunker-params``.
Indexes / Caches
----------------
The **files cache** is stored in ``cache/files`` and is indexed on the
``file path hash``. At backup time, it is used to quickly determine whether we
need to chunk a given file (or whether it is unchanged and we already have all
its pieces).
It contains:
The **files cache** is stored in ``cache/files`` and is used at backup time to
quickly determine whether a given file is unchanged and we have all its chunks.
* age
* file inode number
* file size
* file mtime_ns
* file content chunk hashes
The files cache is a key -> value mapping and contains:
The inode number is stored to make sure we distinguish between
* key:
- full, absolute file path id_hash
* value:
- file inode number
- file size
- file mtime_ns
- list of file content chunk id hashes
- age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
To determine whether a file has not changed, cached values are looked up via
the key in the mapping and compared to the current file attribute values.
If the file's size, mtime_ns and inode number is still the same, it is
considered to not have changed. In that case, we check that all file content
chunks are (still) present in the repository (we check that via the chunks
cache).
If everything is matching and all chunks are present, the file is not read /
chunked / hashed again (but still a file metadata item is written to the
archive, made from fresh file metadata read from the filesystem). This is
what makes borg so fast when processing unchanged files.
If there is a mismatch or a chunk is missing, the file is read / chunked /
hashed. Chunks already present in repo won't be transferred to repo again.
The inode number is stored and compared to make sure we distinguish between
different files, as a single path may not be unique across different
archives in different setups.
The files cache is stored as a python associative array storing
python objects, which generates a lot of overhead.
Not all filesystems have stable inode numbers. If that is the case, borg can
be told to ignore the inode number in the check via --ignore-inode.
The **chunks cache** is stored in ``cache/chunks`` and is indexed on the
``chunk id_hash``. It is used to determine whether we already have a specific
chunk, to count references to it and also for statistics.
It contains:
The age value is used for cache management. If a file is "seen" in a backup
run, its age is reset to 0, otherwise its age is incremented by one.
If a file was not seen in BORG_FILES_CACHE_TTL backups, its cache entry is
removed. See also: :ref:`always_chunking` and :ref:`a_status_oddity`
* reference count
* size
* encrypted/compressed size
The files cache is a python dictionary, storing python objects, which
generates a lot of overhead.
The **repository index** is stored in ``repo/index.%d`` and is indexed on the
``chunk id_hash``. It is used to determine a chunk's location in the repository.
It contains:
Borg can also work without using the files cache (saves memory if you have a
lot of files or not much RAM free), then all files are assumed to have changed.
This is usually much slower than with files cache.
* segment (that contains the chunk)
* offset (where the chunk is located in the segment)
The **chunks cache** is stored in ``cache/chunks`` and is used to determine
whether we already have a specific chunk, to count references to it and also
for statistics.
The chunks cache is a key -> value mapping and contains:
* key:
- chunk id_hash
* value:
- reference count
- size
- encrypted/compressed size
The chunks cache is a hashindex, a hash table implemented in C and tuned for
memory efficiency.
The **repository index** is stored in ``repo/index.%d`` and is used to
determine a chunk's location in the repository.
The repo index is a key -> value mapping and contains:
* key:
- chunk id_hash
* value:
- segment (that contains the chunk)
- offset (where the chunk is located in the segment)
The repo index is a hashindex, a hash table implemented in C and tuned for
memory efficiency.
The repository index file is random access.
Hints are stored in a file (``repo/hints.%d``).
It contains:
* version