Merge pull request #1875 from ThomasWaldmann/improve-files-cache-docs

improve cache / index docs, esp. files cache docs, fixes #1825
This commit is contained in:
TW 2016-11-26 02:24:50 +01:00 committed by GitHub
commit cc20c98262
1 changed files with 76 additions and 26 deletions

View File

@ -252,44 +252,94 @@ For some more general usage hints see also ``--chunker-params``.
Indexes / Caches Indexes / Caches
---------------- ----------------
The **files cache** is stored in ``cache/files`` and is indexed on the The **files cache** is stored in ``cache/files`` and is used at backup time to
``file path hash``. At backup time, it is used to quickly determine whether we quickly determine whether a given file is unchanged and we have all its chunks.
need to chunk a given file (or whether it is unchanged and we already have all
its pieces).
It contains:
* age The files cache is a key -> value mapping and contains:
* file inode number
* file size
* file mtime_ns
* file content chunk hashes
The inode number is stored to make sure we distinguish between * key:
- full, absolute file path id_hash
* value:
- file inode number
- file size
- file mtime_ns
- list of file content chunk id hashes
- age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
To determine whether a file has not changed, cached values are looked up via
the key in the mapping and compared to the current file attribute values.
If the file's size, mtime_ns and inode number is still the same, it is
considered to not have changed. In that case, we check that all file content
chunks are (still) present in the repository (we check that via the chunks
cache).
If everything is matching and all chunks are present, the file is not read /
chunked / hashed again (but still a file metadata item is written to the
archive, made from fresh file metadata read from the filesystem). This is
what makes borg so fast when processing unchanged files.
If there is a mismatch or a chunk is missing, the file is read / chunked /
hashed. Chunks already present in repo won't be transferred to repo again.
The inode number is stored and compared to make sure we distinguish between
different files, as a single path may not be unique across different different files, as a single path may not be unique across different
archives in different setups. archives in different setups.
The files cache is stored as a python associative array storing Not all filesystems have stable inode numbers. If that is the case, borg can
python objects, which generates a lot of overhead. be told to ignore the inode number in the check via --ignore-inode.
The **chunks cache** is stored in ``cache/chunks`` and is indexed on the The age value is used for cache management. If a file is "seen" in a backup
``chunk id_hash``. It is used to determine whether we already have a specific run, its age is reset to 0, otherwise its age is incremented by one.
chunk, to count references to it and also for statistics. If a file was not seen in BORG_FILES_CACHE_TTL backups, its cache entry is
It contains: removed. See also: :ref:`always_chunking` and :ref:`a_status_oddity`
* reference count The files cache is a python dictionary, storing python objects, which
* size generates a lot of overhead.
* encrypted/compressed size
The **repository index** is stored in ``repo/index.%d`` and is indexed on the Borg can also work without using the files cache (saves memory if you have a
``chunk id_hash``. It is used to determine a chunk's location in the repository. lot of files or not much RAM free), then all files are assumed to have changed.
It contains: This is usually much slower than with files cache.
* segment (that contains the chunk) The **chunks cache** is stored in ``cache/chunks`` and is used to determine
* offset (where the chunk is located in the segment) whether we already have a specific chunk, to count references to it and also
for statistics.
The chunks cache is a key -> value mapping and contains:
* key:
- chunk id_hash
* value:
- reference count
- size
- encrypted/compressed size
The chunks cache is a hashindex, a hash table implemented in C and tuned for
memory efficiency.
The **repository index** is stored in ``repo/index.%d`` and is used to
determine a chunk's location in the repository.
The repo index is a key -> value mapping and contains:
* key:
- chunk id_hash
* value:
- segment (that contains the chunk)
- offset (where the chunk is located in the segment)
The repo index is a hashindex, a hash table implemented in C and tuned for
memory efficiency.
The repository index file is random access.
Hints are stored in a file (``repo/hints.%d``). Hints are stored in a file (``repo/hints.%d``).
It contains: It contains:
* version * version