mirror of https://github.com/borgbackup/borg.git
Merge branch 'master' into loggedio-exceptions
Conflicts: borg/repository.py
This commit is contained in:
commit
08688fbc13
3
AUTHORS
3
AUTHORS
|
@ -1,6 +1,7 @@
|
|||
Borg Developers / Contributors ("The Borg Collective")
|
||||
``````````````````````````````````````````````````````
|
||||
- Thomas Waldmann
|
||||
- Thomas Waldmann <tw@waldmann-edv.de>
|
||||
- Antoine Beaupré
|
||||
|
||||
|
||||
Borg is a fork of Attic. Attic is written and maintained
|
||||
|
|
120
CHANGES
120
CHANGES
|
@ -1,56 +1,108 @@
|
|||
Borg Changelog
|
||||
==============
|
||||
|
||||
Version <TBD>
|
||||
-------------
|
||||
|
||||
Version 0.24.0
|
||||
--------------
|
||||
|
||||
New features:
|
||||
|
||||
- borg create --chunker-params ... to configure the chunker.
|
||||
See docs/misc/create_chunker-params.txt for more information.
|
||||
- borg info now reports chunk counts in the chunk index.
|
||||
|
||||
Bug fixes:
|
||||
|
||||
- reduce memory usage, see --chunker-params, fixes #16.
|
||||
This can be used to reduce chunk management overhead, so borg does not create
|
||||
a huge chunks index/repo index and eats all your RAM if you back up lots of
|
||||
data in huge files (like VM disk images).
|
||||
- better Exception msg if there is no Borg installed on the remote repo server.
|
||||
|
||||
Other changes:
|
||||
|
||||
- Fedora/Fedora-based install instructions added to docs.
|
||||
- added docs/misc directory for misc. writeups that won't be included "as is"
|
||||
into the html docs.
|
||||
|
||||
|
||||
I forgot to list some stuff already implemented in 0.23.0, here they are:
|
||||
|
||||
New features:
|
||||
|
||||
- efficient archive list from manifest, meaning a big speedup for slow
|
||||
repo connections and "list <repo>", "delete <repo>", "prune"
|
||||
- big speedup for chunks cache sync (esp. for slow repo connections), fixes #18
|
||||
- hashindex: improve error messages
|
||||
|
||||
Other changes:
|
||||
|
||||
- explicitly specify binary mode to open binary files
|
||||
- some easy micro optimizations
|
||||
|
||||
|
||||
Version 0.23.0
|
||||
--------------
|
||||
|
||||
Incompatible changes (compared to attic, fork related):
|
||||
|
||||
- changed sw name and cli command to "borg", updated docs
|
||||
- package name and name in urls uses "borgbackup" to have less collisions
|
||||
- package name (and name in urls) uses "borgbackup" to have less collisions
|
||||
- changed repo / cache internal magic strings from ATTIC* to BORG*,
|
||||
changed cache location to .cache/borg/
|
||||
- give specific path to xattr.is_enabled(), disable symlink setattr call that
|
||||
always fails
|
||||
- fix misleading hint the fuse ImportError handler gave, fixes attic #237
|
||||
- source: misc. cleanups, pep8, style
|
||||
- implement check --last N
|
||||
- check: sort archives in reverse time order
|
||||
changed cache location to .cache/borg/ - this means that it currently won't
|
||||
accept attic repos (see issue #21 about improving that)
|
||||
|
||||
Bug fixes:
|
||||
|
||||
- avoid defect python-msgpack releases, fixes attic #171, fixes attic #185
|
||||
- check unpacked data from RPC for tuple type and correct length, fixes attic #127
|
||||
- less memory usage: add global option --no-cache-files
|
||||
- fix traceback when trying to do unsupported passphrase change, fixes attic #189
|
||||
- datetime does not like the year 10.000, fixes attic #139
|
||||
- docs and faq improvements, fixes, updates
|
||||
- cleanup crypto.pyx, make it easier to adapt to other modes
|
||||
- extract: if --stdout is given, write all extracted binary data to stdout
|
||||
- fix "info" all archives stats, fixes attic #183
|
||||
- fix parsing with missing microseconds, fixes attic #282
|
||||
- fix misleading hint the fuse ImportError handler gave, fixes attic #237
|
||||
- check unpacked data from RPC for tuple type and correct length, fixes attic #127
|
||||
- fix Repository._active_txn state when lock upgrade fails
|
||||
- give specific path to xattr.is_enabled(), disable symlink setattr call that
|
||||
always fails
|
||||
- fix test setup for 32bit platforms, partial fix for attic #196
|
||||
- upgraded versioneer, PEP440 compliance, fixes attic #257
|
||||
|
||||
New features:
|
||||
|
||||
- less memory usage: add global option --no-cache-files
|
||||
- check --last N (only check the last N archives)
|
||||
- check: sort archives in reverse time order
|
||||
- rename repo::oldname newname (rename repository)
|
||||
- create -v output more informative
|
||||
- create --progress (backup progress indicator)
|
||||
- create --timestamp (utc string or reference file/dir)
|
||||
- create: if "-" is given as path, read binary from stdin
|
||||
- do os.fsync like recommended in the python docs
|
||||
- extract: if --stdout is given, write all extracted binary data to stdout
|
||||
- extract --sparse (simple sparse file support)
|
||||
- extra debug information for 'fread failed'
|
||||
- delete <repo> (deletes whole repo + local cache)
|
||||
- FUSE: reflect deduplication in allocated blocks
|
||||
- only allow whitelisted RPC calls in server mode
|
||||
- normalize source/exclude paths before matching
|
||||
- fix "info" all archives stats, fixes attic #183
|
||||
- implement create --timestamp, utc string or reference file/dir
|
||||
- simple sparse file support (extract --sparse)
|
||||
- fix parsing with missing microseconds, fixes attic #282
|
||||
- use posix_fadvise to not spoil the OS cache, fixes attic #252
|
||||
- source: Let chunker optionally work with os-level file descriptor.
|
||||
- source: Linux: remove duplicate os.fsencode calls
|
||||
- fix test setup for 32bit platforms, partial fix for attic #196
|
||||
- source: refactor _open_rb code a bit, so it is more consistent / regular
|
||||
- implement rename repo::oldname newname
|
||||
- implement create --progress
|
||||
- source: refactor indicator (status) and item processing
|
||||
- implement delete repo (also deletes local cache)
|
||||
- better create -v output
|
||||
- upgraded versioneer, PEP440 compliance, fixes attic #257
|
||||
- source: use py.test for better testing, flake8 for code style checks
|
||||
- source: fix tox >=2.0 compatibility
|
||||
- toplevel error handler: show tracebacks for better error analysis
|
||||
- sigusr1 / sigint handler to print current file infos - attic PR #286
|
||||
- pypi package: add python version classifiers, add FreeBSD to platforms
|
||||
- fix Repository._active_txn state when lock upgrade fails
|
||||
- RPCError: include the exception args we get from remote
|
||||
|
||||
Other changes:
|
||||
|
||||
- source: misc. cleanups, pep8, style
|
||||
- docs and faq improvements, fixes, updates
|
||||
- cleanup crypto.pyx, make it easier to adapt to other AES modes
|
||||
- do os.fsync like recommended in the python docs
|
||||
- source: Let chunker optionally work with os-level file descriptor.
|
||||
- source: Linux: remove duplicate os.fsencode calls
|
||||
- source: refactor _open_rb code a bit, so it is more consistent / regular
|
||||
- source: refactor indicator (status) and item processing
|
||||
- source: use py.test for better testing, flake8 for code style checks
|
||||
- source: fix tox >=2.0 compatibility (test runner)
|
||||
- pypi package: add python version classifiers, add FreeBSD to platforms
|
||||
|
||||
|
||||
Attic Changelog
|
||||
===============
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
include README.rst LICENSE CHANGES MANIFEST.in versioneer.py
|
||||
include README.rst AUTHORS LICENSE CHANGES MANIFEST.in versioneer.py
|
||||
recursive-include borg *.pyx
|
||||
recursive-include docs *
|
||||
recursive-exclude docs *.pyc
|
||||
|
|
10
README.rst
10
README.rst
|
@ -10,8 +10,12 @@ are stored.
|
|||
Borg is a fork of Attic and maintained by "The Borg Collective" (see AUTHORS file).
|
||||
|
||||
BORG IS NOT COMPATIBLE WITH ORIGINAL ATTIC.
|
||||
UNTIL FURTHER NOTICE, EXPECT THAT WE WILL BREAK COMPATIBILITY REPEATEDLY.
|
||||
THIS IS SOFTWARE IN DEVELOPMENT, DECIDE YOURSELF IF IT FITS YOUR NEEDS.
|
||||
EXPECT THAT WE WILL BREAK COMPATIBILITY REPEATEDLY WHEN MAJOR RELEASE NUMBER
|
||||
CHANGES (like when going from 0.x.y to 1.0.0). Please read CHANGES document.
|
||||
|
||||
NOT RELEASED DEVELOPMENT VERSIONS HAVE UNKNOWN COMPATIBILITY PROPERTIES.
|
||||
|
||||
THIS IS SOFTWARE IN DEVELOPMENT, DECIDE YOURSELF WHETHER IT FITS YOUR NEEDS.
|
||||
|
||||
Read issue #1 on the issue tracker, goals are being defined there.
|
||||
|
||||
|
@ -66,7 +70,7 @@ Where are the tests?
|
|||
The tests are in the borg/testsuite package. To run the test suite use the
|
||||
following command::
|
||||
|
||||
$ fakeroot -u tox # you need to have tox installed
|
||||
$ fakeroot -u tox # you need to have tox and pytest installed
|
||||
|
||||
.. |build| image:: https://travis-ci.org/borgbackup/borg.svg
|
||||
:alt: Build Status
|
||||
|
|
|
@ -18,8 +18,11 @@
|
|||
#error Unknown byte order
|
||||
#endif
|
||||
|
||||
#define MAGIC "BORG_IDX"
|
||||
#define MAGIC_LEN 8
|
||||
|
||||
typedef struct {
|
||||
char magic[8];
|
||||
char magic[MAGIC_LEN];
|
||||
int32_t num_entries;
|
||||
int32_t num_buckets;
|
||||
int8_t key_size;
|
||||
|
@ -37,7 +40,6 @@ typedef struct {
|
|||
int upper_limit;
|
||||
} HashIndex;
|
||||
|
||||
#define MAGIC "BORG_IDX"
|
||||
#define EMPTY _htole32(0xffffffff)
|
||||
#define DELETED _htole32(0xfffffffe)
|
||||
#define MAX_BUCKET_SIZE 512
|
||||
|
@ -162,7 +164,7 @@ hashindex_read(const char *path)
|
|||
EPRINTF_PATH(path, "fseek failed");
|
||||
goto fail;
|
||||
}
|
||||
if(memcmp(header.magic, MAGIC, 8)) {
|
||||
if(memcmp(header.magic, MAGIC, MAGIC_LEN)) {
|
||||
EPRINTF_MSG_PATH(path, "Unknown MAGIC in header");
|
||||
goto fail;
|
||||
}
|
||||
|
@ -359,14 +361,18 @@ hashindex_get_size(HashIndex *index)
|
|||
}
|
||||
|
||||
static void
|
||||
hashindex_summarize(HashIndex *index, long long *total_size, long long *total_csize, long long *total_unique_size, long long *total_unique_csize)
|
||||
hashindex_summarize(HashIndex *index, long long *total_size, long long *total_csize,
|
||||
long long *total_unique_size, long long *total_unique_csize,
|
||||
long long *total_unique_chunks, long long *total_chunks)
|
||||
{
|
||||
int64_t size = 0, csize = 0, unique_size = 0, unique_csize = 0;
|
||||
int64_t size = 0, csize = 0, unique_size = 0, unique_csize = 0, chunks = 0, unique_chunks = 0;
|
||||
const int32_t *values;
|
||||
void *key = NULL;
|
||||
|
||||
while((key = hashindex_next_key(index, key))) {
|
||||
values = key + 32;
|
||||
values = key + index->key_size;
|
||||
unique_chunks++;
|
||||
chunks += values[0];
|
||||
unique_size += values[1];
|
||||
unique_csize += values[2];
|
||||
size += values[0] * values[1];
|
||||
|
@ -376,4 +382,6 @@ hashindex_summarize(HashIndex *index, long long *total_size, long long *total_cs
|
|||
*total_csize = csize;
|
||||
*total_unique_size = unique_size;
|
||||
*total_unique_csize = unique_csize;
|
||||
*total_unique_chunks = unique_chunks;
|
||||
*total_chunks = chunks;
|
||||
}
|
||||
|
|
|
@ -21,12 +21,14 @@ from .helpers import parse_timestamp, Error, uid2user, user2uid, gid2group, grou
|
|||
Manifest, Statistics, decode_dict, st_mtime_ns, make_path_safe, StableDict, int_to_bigint, bigint_to_int
|
||||
|
||||
ITEMS_BUFFER = 1024 * 1024
|
||||
CHUNK_MIN = 1024
|
||||
CHUNK_MAX = 10 * 1024 * 1024
|
||||
WINDOW_SIZE = 0xfff
|
||||
CHUNK_MASK = 0xffff
|
||||
|
||||
ZEROS = b'\0' * CHUNK_MAX
|
||||
CHUNK_MIN_EXP = 10 # 2**10 == 1kiB
|
||||
CHUNK_MAX_EXP = 23 # 2**23 == 8MiB
|
||||
HASH_WINDOW_SIZE = 0xfff # 4095B
|
||||
HASH_MASK_BITS = 16 # results in ~64kiB chunks statistically
|
||||
|
||||
# defaults, use --chunker-params to override
|
||||
CHUNKER_PARAMS = (CHUNK_MIN_EXP, CHUNK_MAX_EXP, HASH_MASK_BITS, HASH_WINDOW_SIZE)
|
||||
|
||||
utime_supports_fd = os.utime in getattr(os, 'supports_fd', {})
|
||||
utime_supports_follow_symlinks = os.utime in getattr(os, 'supports_follow_symlinks', {})
|
||||
|
@ -69,12 +71,12 @@ class DownloadPipeline:
|
|||
class ChunkBuffer:
|
||||
BUFFER_SIZE = 1 * 1024 * 1024
|
||||
|
||||
def __init__(self, key):
|
||||
def __init__(self, key, chunker_params=CHUNKER_PARAMS):
|
||||
self.buffer = BytesIO()
|
||||
self.packer = msgpack.Packer(unicode_errors='surrogateescape')
|
||||
self.chunks = []
|
||||
self.key = key
|
||||
self.chunker = Chunker(WINDOW_SIZE, CHUNK_MASK, CHUNK_MIN, CHUNK_MAX,self.key.chunk_seed)
|
||||
self.chunker = Chunker(self.key.chunk_seed, *chunker_params)
|
||||
|
||||
def add(self, item):
|
||||
self.buffer.write(self.packer.pack(StableDict(item)))
|
||||
|
@ -104,8 +106,8 @@ class ChunkBuffer:
|
|||
|
||||
class CacheChunkBuffer(ChunkBuffer):
|
||||
|
||||
def __init__(self, cache, key, stats):
|
||||
super(CacheChunkBuffer, self).__init__(key)
|
||||
def __init__(self, cache, key, stats, chunker_params=CHUNKER_PARAMS):
|
||||
super(CacheChunkBuffer, self).__init__(key, chunker_params)
|
||||
self.cache = cache
|
||||
self.stats = stats
|
||||
|
||||
|
@ -127,7 +129,8 @@ class Archive:
|
|||
|
||||
|
||||
def __init__(self, repository, key, manifest, name, cache=None, create=False,
|
||||
checkpoint_interval=300, numeric_owner=False, progress=False):
|
||||
checkpoint_interval=300, numeric_owner=False, progress=False,
|
||||
chunker_params=CHUNKER_PARAMS):
|
||||
self.cwd = os.getcwd()
|
||||
self.key = key
|
||||
self.repository = repository
|
||||
|
@ -142,8 +145,8 @@ class Archive:
|
|||
self.numeric_owner = numeric_owner
|
||||
self.pipeline = DownloadPipeline(self.repository, self.key)
|
||||
if create:
|
||||
self.items_buffer = CacheChunkBuffer(self.cache, self.key, self.stats)
|
||||
self.chunker = Chunker(WINDOW_SIZE, CHUNK_MASK, CHUNK_MIN, CHUNK_MAX, self.key.chunk_seed)
|
||||
self.items_buffer = CacheChunkBuffer(self.cache, self.key, self.stats, chunker_params)
|
||||
self.chunker = Chunker(self.key.chunk_seed, *chunker_params)
|
||||
if name in manifest.archives:
|
||||
raise self.AlreadyExists(name)
|
||||
self.last_checkpoint = time.time()
|
||||
|
@ -158,6 +161,7 @@ class Archive:
|
|||
raise self.DoesNotExist(name)
|
||||
info = self.manifest.archives[name]
|
||||
self.load(info[b'id'])
|
||||
self.zeros = b'\0' * (1 << chunker_params[1])
|
||||
|
||||
def _load_meta(self, id):
|
||||
data = self.key.decrypt(id, self.repository.get(id))
|
||||
|
@ -286,7 +290,7 @@ class Archive:
|
|||
with open(path, 'wb') as fd:
|
||||
ids = [c[0] for c in item[b'chunks']]
|
||||
for data in self.pipeline.fetch_many(ids, is_preloaded=True):
|
||||
if sparse and ZEROS.startswith(data):
|
||||
if sparse and self.zeros.startswith(data):
|
||||
# all-zero chunk: create a hole in a sparse file
|
||||
fd.seek(len(data), 1)
|
||||
else:
|
||||
|
|
|
@ -13,7 +13,7 @@ import textwrap
|
|||
import traceback
|
||||
|
||||
from . import __version__
|
||||
from .archive import Archive, ArchiveChecker
|
||||
from .archive import Archive, ArchiveChecker, CHUNKER_PARAMS
|
||||
from .repository import Repository
|
||||
from .cache import Cache
|
||||
from .key import key_creator
|
||||
|
@ -21,7 +21,7 @@ from .helpers import Error, location_validator, format_time, format_file_size, \
|
|||
format_file_mode, ExcludePattern, exclude_path, adjust_patterns, to_localtime, timestamp, \
|
||||
get_cache_dir, get_keys_dir, format_timedelta, prune_within, prune_split, \
|
||||
Manifest, remove_surrogates, update_excludes, format_archive, check_extension_modules, Statistics, \
|
||||
is_cachedir, bigint_to_int
|
||||
is_cachedir, bigint_to_int, ChunkerParams
|
||||
from .remote import RepositoryServer, RemoteRepository
|
||||
|
||||
|
||||
|
@ -104,7 +104,8 @@ Type "Yes I am sure" if you understand this and want to continue.\n""")
|
|||
cache = Cache(repository, key, manifest, do_files=args.cache_files)
|
||||
archive = Archive(repository, key, manifest, args.archive.archive, cache=cache,
|
||||
create=True, checkpoint_interval=args.checkpoint_interval,
|
||||
numeric_owner=args.numeric_owner, progress=args.progress)
|
||||
numeric_owner=args.numeric_owner, progress=args.progress,
|
||||
chunker_params=args.chunker_params)
|
||||
# Add cache dir to inode_skip list
|
||||
skip_inodes = set()
|
||||
try:
|
||||
|
@ -515,8 +516,12 @@ Type "Yes I am sure" if you understand this and want to continue.\n""")
|
|||
parser = argparse.ArgumentParser(description='Borg %s - Deduplicated Backups' % __version__)
|
||||
subparsers = parser.add_subparsers(title='Available commands')
|
||||
|
||||
serve_epilog = textwrap.dedent("""
|
||||
This command starts a repository server process. This command is usually not used manually.
|
||||
""")
|
||||
subparser = subparsers.add_parser('serve', parents=[common_parser],
|
||||
description=self.do_serve.__doc__)
|
||||
description=self.do_serve.__doc__, epilog=serve_epilog,
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
subparser.set_defaults(func=self.do_serve)
|
||||
subparser.add_argument('--restrict-to-path', dest='restrict_to_paths', action='append',
|
||||
metavar='PATH', help='restrict repository access to PATH')
|
||||
|
@ -621,6 +626,10 @@ Type "Yes I am sure" if you understand this and want to continue.\n""")
|
|||
metavar='yyyy-mm-ddThh:mm:ss',
|
||||
help='manually specify the archive creation date/time (UTC). '
|
||||
'alternatively, give a reference file/directory.')
|
||||
subparser.add_argument('--chunker-params', dest='chunker_params',
|
||||
type=ChunkerParams, default=CHUNKER_PARAMS,
|
||||
metavar='CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE',
|
||||
help='specify the chunker parameters. default: %d,%d,%d,%d' % CHUNKER_PARAMS)
|
||||
subparser.add_argument('archive', metavar='ARCHIVE',
|
||||
type=location_validator(archive=True),
|
||||
help='archive to create')
|
||||
|
|
|
@ -20,8 +20,11 @@ cdef extern from "_chunker.c":
|
|||
cdef class Chunker:
|
||||
cdef _Chunker *chunker
|
||||
|
||||
def __cinit__(self, window_size, chunk_mask, min_size, max_size, seed):
|
||||
self.chunker = chunker_init(window_size, chunk_mask, min_size, max_size, seed & 0xffffffff)
|
||||
def __cinit__(self, seed, chunk_min_exp, chunk_max_exp, hash_mask_bits, hash_window_size):
|
||||
min_size = 1 << chunk_min_exp
|
||||
max_size = 1 << chunk_max_exp
|
||||
hash_mask = (1 << hash_mask_bits) - 1
|
||||
self.chunker = chunker_init(hash_window_size, hash_mask, min_size, max_size, seed & 0xffffffff)
|
||||
|
||||
def chunkify(self, fd, fh=-1):
|
||||
"""
|
||||
|
|
|
@ -11,7 +11,9 @@ cdef extern from "_hashindex.c":
|
|||
HashIndex *hashindex_read(char *path)
|
||||
HashIndex *hashindex_init(int capacity, int key_size, int value_size)
|
||||
void hashindex_free(HashIndex *index)
|
||||
void hashindex_summarize(HashIndex *index, long long *total_size, long long *total_csize, long long *unique_size, long long *unique_csize)
|
||||
void hashindex_summarize(HashIndex *index, long long *total_size, long long *total_csize,
|
||||
long long *unique_size, long long *unique_csize,
|
||||
long long *total_unique_chunks, long long *total_chunks)
|
||||
int hashindex_get_size(HashIndex *index)
|
||||
int hashindex_write(HashIndex *index, char *path)
|
||||
void *hashindex_get(HashIndex *index, void *key)
|
||||
|
@ -179,9 +181,11 @@ cdef class ChunkIndex(IndexBase):
|
|||
return iter
|
||||
|
||||
def summarize(self):
|
||||
cdef long long total_size, total_csize, unique_size, unique_csize
|
||||
hashindex_summarize(self.index, &total_size, &total_csize, &unique_size, &unique_csize)
|
||||
return total_size, total_csize, unique_size, unique_csize
|
||||
cdef long long total_size, total_csize, unique_size, unique_csize, total_unique_chunks, total_chunks
|
||||
hashindex_summarize(self.index, &total_size, &total_csize,
|
||||
&unique_size, &unique_csize,
|
||||
&total_unique_chunks, &total_chunks)
|
||||
return total_size, total_csize, unique_size, unique_csize, total_unique_chunks, total_chunks
|
||||
|
||||
|
||||
cdef class ChunkKeyIterator:
|
||||
|
|
|
@ -174,11 +174,14 @@ class Statistics:
|
|||
self.usize += csize
|
||||
|
||||
def print_(self, label, cache):
|
||||
total_size, total_csize, unique_size, unique_csize = cache.chunks.summarize()
|
||||
total_size, total_csize, unique_size, unique_csize, total_unique_chunks, total_chunks = cache.chunks.summarize()
|
||||
print()
|
||||
print(' Original size Compressed size Deduplicated size')
|
||||
print('%-15s %20s %20s %20s' % (label, format_file_size(self.osize), format_file_size(self.csize), format_file_size(self.usize)))
|
||||
print('All archives: %20s %20s %20s' % (format_file_size(total_size), format_file_size(total_csize), format_file_size(unique_csize)))
|
||||
print()
|
||||
print(' Unique chunks Total chunks')
|
||||
print('Chunk index: %20d %20d' % (total_unique_chunks, total_chunks))
|
||||
|
||||
def show_progress(self, item=None, final=False):
|
||||
if not final:
|
||||
|
@ -310,6 +313,11 @@ def timestamp(s):
|
|||
raise ValueError
|
||||
|
||||
|
||||
def ChunkerParams(s):
|
||||
window_size, chunk_mask, chunk_min, chunk_max = s.split(',')
|
||||
return int(window_size), int(chunk_mask), int(chunk_min), int(chunk_max)
|
||||
|
||||
|
||||
def is_cachedir(path):
|
||||
"""Determines whether the specified path is a cache directory (and
|
||||
therefore should potentially be excluded from the backup) according to
|
||||
|
|
|
@ -141,7 +141,10 @@ class RemoteRepository:
|
|||
self.r_fds = [self.stdout_fd]
|
||||
self.x_fds = [self.stdin_fd, self.stdout_fd]
|
||||
|
||||
version = self.call('negotiate', 1)
|
||||
try:
|
||||
version = self.call('negotiate', 1)
|
||||
except ConnectionClosed:
|
||||
raise Exception('Server immediately closed connection - is Borg installed and working on the server?')
|
||||
if version != 1:
|
||||
raise Exception('Server insisted on using unsupported protocol version %d' % version)
|
||||
self.id = self.call('open', location.path, create)
|
||||
|
|
|
@ -14,6 +14,7 @@ from .lrucache import LRUCache
|
|||
|
||||
MAX_OBJECT_SIZE = 20 * 1024 * 1024
|
||||
MAGIC = b'BORG_SEG'
|
||||
MAGIC_LEN = len(MAGIC)
|
||||
TAG_PUT = 0
|
||||
TAG_DELETE = 1
|
||||
TAG_COMMIT = 2
|
||||
|
@ -481,7 +482,7 @@ class LoggedIO:
|
|||
os.mkdir(dirname)
|
||||
self._write_fd = open(self.segment_filename(self.segment), 'ab')
|
||||
self._write_fd.write(MAGIC)
|
||||
self.offset = 8
|
||||
self.offset = MAGIC_LEN
|
||||
return self._write_fd
|
||||
|
||||
def get_fd(self, segment):
|
||||
|
@ -504,9 +505,9 @@ class LoggedIO:
|
|||
def iter_objects(self, segment, include_data=False):
|
||||
fd = self.get_fd(segment)
|
||||
fd.seek(0)
|
||||
if fd.read(8) != MAGIC:
|
||||
if fd.read(MAGIC_LEN) != MAGIC:
|
||||
raise IntegrityError('Invalid segment magic')
|
||||
offset = 8
|
||||
offset = MAGIC_LEN
|
||||
header = fd.read(self.header_fmt.size)
|
||||
while header:
|
||||
try:
|
||||
|
|
|
@ -12,7 +12,7 @@ import unittest
|
|||
from hashlib import sha256
|
||||
|
||||
from .. import xattr
|
||||
from ..archive import Archive, ChunkBuffer, CHUNK_MAX
|
||||
from ..archive import Archive, ChunkBuffer, CHUNK_MAX_EXP
|
||||
from ..archiver import Archiver
|
||||
from ..cache import Cache
|
||||
from ..crypto import bytes_to_long, num_aes_blocks
|
||||
|
@ -213,7 +213,7 @@ class ArchiverTestCase(ArchiverTestCaseBase):
|
|||
sparse_support = sys.platform != 'darwin'
|
||||
filename = os.path.join(self.input_path, 'sparse')
|
||||
content = b'foobar'
|
||||
hole_size = 5 * CHUNK_MAX # 5 full chunker buffers
|
||||
hole_size = 5 * (1 << CHUNK_MAX_EXP) # 5 full chunker buffers
|
||||
with open(filename, 'wb') as fd:
|
||||
# create a file that has a hole at the beginning and end (if the
|
||||
# OS and filesystem supports sparse files)
|
||||
|
|
|
@ -1,27 +1,27 @@
|
|||
from io import BytesIO
|
||||
|
||||
from ..chunker import Chunker, buzhash, buzhash_update
|
||||
from ..archive import CHUNK_MAX
|
||||
from ..archive import CHUNK_MAX_EXP
|
||||
from . import BaseTestCase
|
||||
|
||||
|
||||
class ChunkerTestCase(BaseTestCase):
|
||||
|
||||
def test_chunkify(self):
|
||||
data = b'0' * int(1.5 * CHUNK_MAX) + b'Y'
|
||||
parts = [bytes(c) for c in Chunker(2, 0x3, 2, CHUNK_MAX, 0).chunkify(BytesIO(data))]
|
||||
data = b'0' * int(1.5 * (1 << CHUNK_MAX_EXP)) + b'Y'
|
||||
parts = [bytes(c) for c in Chunker(0, 1, CHUNK_MAX_EXP, 2, 2).chunkify(BytesIO(data))]
|
||||
self.assert_equal(len(parts), 2)
|
||||
self.assert_equal(b''.join(parts), data)
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 0x3, 2, CHUNK_MAX, 0).chunkify(BytesIO(b''))], [])
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 0x3, 2, CHUNK_MAX, 0).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'fooba', b'rboobaz', b'fooba', b'rboobaz', b'fooba', b'rboobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 0x3, 2, CHUNK_MAX, 1).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'fo', b'obarb', b'oob', b'azf', b'oobarb', b'oob', b'azf', b'oobarb', b'oobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 0x3, 2, CHUNK_MAX, 2).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foob', b'ar', b'boobazfoob', b'ar', b'boobazfoob', b'ar', b'boobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(3, 0x3, 3, CHUNK_MAX, 0).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobarboobaz' * 3])
|
||||
self.assert_equal([bytes(c) for c in Chunker(3, 0x3, 3, CHUNK_MAX, 1).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobar', b'boo', b'bazfo', b'obar', b'boo', b'bazfo', b'obar', b'boobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(3, 0x3, 3, CHUNK_MAX, 2).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foo', b'barboobaz', b'foo', b'barboobaz', b'foo', b'barboobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(3, 0x3, 4, CHUNK_MAX, 0).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobarboobaz' * 3])
|
||||
self.assert_equal([bytes(c) for c in Chunker(3, 0x3, 4, CHUNK_MAX, 1).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobar', b'boobazfo', b'obar', b'boobazfo', b'obar', b'boobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(3, 0x3, 4, CHUNK_MAX, 2).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foob', b'arboobaz', b'foob', b'arboobaz', b'foob', b'arboobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(0, 1, CHUNK_MAX_EXP, 2, 2).chunkify(BytesIO(b''))], [])
|
||||
self.assert_equal([bytes(c) for c in Chunker(0, 1, CHUNK_MAX_EXP, 2, 2).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'fooba', b'rboobaz', b'fooba', b'rboobaz', b'fooba', b'rboobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(1, 1, CHUNK_MAX_EXP, 2, 2).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'fo', b'obarb', b'oob', b'azf', b'oobarb', b'oob', b'azf', b'oobarb', b'oobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 1, CHUNK_MAX_EXP, 2, 2).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foob', b'ar', b'boobazfoob', b'ar', b'boobazfoob', b'ar', b'boobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(0, 2, CHUNK_MAX_EXP, 2, 3).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobarboobaz' * 3])
|
||||
self.assert_equal([bytes(c) for c in Chunker(1, 2, CHUNK_MAX_EXP, 2, 3).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobar', b'boobazfo', b'obar', b'boobazfo', b'obar', b'boobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 2, CHUNK_MAX_EXP, 2, 3).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foob', b'arboobaz', b'foob', b'arboobaz', b'foob', b'arboobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(0, 3, CHUNK_MAX_EXP, 2, 3).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobarboobaz' * 3])
|
||||
self.assert_equal([bytes(c) for c in Chunker(1, 3, CHUNK_MAX_EXP, 2, 3).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobarbo', b'obazfoobar', b'boobazfo', b'obarboobaz'])
|
||||
self.assert_equal([bytes(c) for c in Chunker(2, 3, CHUNK_MAX_EXP, 2, 3).chunkify(BytesIO(b'foobarboobaz' * 3))], [b'foobarboobaz', b'foobarboobaz', b'foobarboobaz'])
|
||||
|
||||
def test_buzhash(self):
|
||||
self.assert_equal(buzhash(b'abcdefghijklmnop', 0), 3795437769)
|
||||
|
|
|
@ -161,8 +161,8 @@ p.admonition-title:after {
|
|||
}
|
||||
|
||||
div.note {
|
||||
background-color: #0f5;
|
||||
border-bottom: 2px solid #d22;
|
||||
background-color: #002211;
|
||||
border-bottom: 2px solid #22dd22;
|
||||
}
|
||||
|
||||
div.seealso {
|
||||
|
|
|
@ -51,7 +51,7 @@ Which file types, attributes, etc. are *not* preserved?
|
|||
recreate them in any case). So, don't panic if your backup misses a UDS!
|
||||
* The precise on-disk representation of the holes in a sparse file.
|
||||
Archive creation has no special support for sparse files, holes are
|
||||
backed up up as (deduplicated and compressed) runs of zero bytes.
|
||||
backed up as (deduplicated and compressed) runs of zero bytes.
|
||||
Archive extraction has optional support to extract all-zero chunks as
|
||||
holes in a sparse file.
|
||||
|
||||
|
|
|
@ -62,21 +62,60 @@ Some of the steps detailled below might be useful also for non-git installs.
|
|||
# optional: for unit testing
|
||||
apt-get install fakeroot
|
||||
|
||||
# install virtualenv tool, create and activate a virtual env
|
||||
apt-get install python-virtualenv
|
||||
virtualenv --python=python3 borg-env
|
||||
source borg-env/bin/activate # always do this before using!
|
||||
|
||||
# install some dependencies into virtual env
|
||||
pip install cython # to compile .pyx -> .c
|
||||
pip install tox pytest # optional, for running unit tests
|
||||
pip install sphinx # optional, to build the docs
|
||||
|
||||
# get |project_name| from github, install it
|
||||
git clone |git_url|
|
||||
|
||||
apt-get install python-virtualenv
|
||||
virtualenv --python=python3 borg-env
|
||||
source borg-env/bin/activate # always before using!
|
||||
|
||||
# install borg + dependencies into virtualenv
|
||||
pip install cython # compile .pyx -> .c
|
||||
pip install tox pytest # optional, for running unit tests
|
||||
pip install sphinx # optional, to build the docs
|
||||
cd borg
|
||||
pip install -e . # in-place editable mode
|
||||
|
||||
# optional: run all the tests, on all supported Python versions
|
||||
fakeroot -u tox
|
||||
|
||||
|
||||
Korora / Fedora 21 installation (from git)
|
||||
------------------------------------------
|
||||
Note: this uses latest, unreleased development code from git.
|
||||
While we try not to break master, there are no guarantees on anything.
|
||||
|
||||
Some of the steps detailled below might be useful also for non-git installs.
|
||||
|
||||
.. parsed-literal::
|
||||
# Python 3.x (>= 3.2) + Headers, Py Package Installer
|
||||
sudo dnf install python3 python3-devel python3-pip
|
||||
|
||||
# we need OpenSSL + Headers for Crypto
|
||||
sudo dnf install openssl-devel openssl
|
||||
|
||||
# ACL support Headers + Library
|
||||
sudo dnf install libacl-devel libacl
|
||||
|
||||
# optional: lowlevel FUSE py binding - to mount backup archives
|
||||
sudo dnf install python3-llfuse fuse
|
||||
|
||||
# optional: for unit testing
|
||||
sudo dnf install fakeroot
|
||||
|
||||
# get |project_name| from github, install it
|
||||
git clone |git_url|
|
||||
|
||||
dnf install python3-virtualenv
|
||||
virtualenv --python=python3 borg-env
|
||||
source borg-env/bin/activate # always before using!
|
||||
|
||||
# install borg + dependencies into virtualenv
|
||||
pip install cython # compile .pyx -> .c
|
||||
pip install tox pytest # optional, for running unit tests
|
||||
pip install sphinx # optional, to build the docs
|
||||
cd borg
|
||||
pip install -e . # in-place editable mode
|
||||
|
||||
# optional: run all the tests, on all supported Python versions
|
||||
fakeroot -u tox
|
||||
|
|
|
@ -6,38 +6,43 @@ Internals
|
|||
|
||||
This page documents the internal data structures and storage
|
||||
mechanisms of |project_name|. It is partly based on `mailing list
|
||||
discussion about internals`_ and also on static code analysis. It may
|
||||
not be exactly up to date with the current source code.
|
||||
discussion about internals`_ and also on static code analysis.
|
||||
|
||||
It may not be exactly up to date with the current source code.
|
||||
|
||||
Repository and Archives
|
||||
-----------------------
|
||||
|
||||
|project_name| stores its data in a `Repository`. Each repository can
|
||||
hold multiple `Archives`, which represent individual backups that
|
||||
contain a full archive of the files specified when the backup was
|
||||
performed. Deduplication is performed across multiple backups, both on
|
||||
data and metadata, using `Segments` chunked with the Buzhash_
|
||||
algorithm. Each repository has the following file structure:
|
||||
data and metadata, using `Chunks` created by the chunker using the Buzhash_
|
||||
algorithm.
|
||||
|
||||
Each repository has the following file structure:
|
||||
|
||||
README
|
||||
simple text file describing the repository
|
||||
simple text file telling that this is a |project_name| repository
|
||||
|
||||
config
|
||||
description of the repository, includes the unique identifier. also
|
||||
acts as a lock file
|
||||
repository configuration and lock file
|
||||
|
||||
data/
|
||||
directory where the actual data (`segments`) is stored
|
||||
directory where the actual data is stored
|
||||
|
||||
hints.%d
|
||||
undocumented
|
||||
hints for repository compaction
|
||||
|
||||
index.%d
|
||||
cache of the file indexes. those files can be regenerated with
|
||||
``check --repair``
|
||||
repository index
|
||||
|
||||
|
||||
Config file
|
||||
-----------
|
||||
|
||||
Each repository has a ``config`` file which which is a ``INI``
|
||||
formatted file which looks like this::
|
||||
Each repository has a ``config`` file which which is a ``INI``-style file
|
||||
and looks like this::
|
||||
|
||||
[repository]
|
||||
version = 1
|
||||
|
@ -48,20 +53,35 @@ formatted file which looks like this::
|
|||
This is where the ``repository.id`` is stored. It is a unique
|
||||
identifier for repositories. It will not change if you move the
|
||||
repository around so you can make a local transfer then decide to move
|
||||
the repository in another (even remote) location at a later time.
|
||||
the repository to another (even remote) location at a later time.
|
||||
|
||||
|project_name| will do a POSIX read lock on that file when operating
|
||||
|project_name| will do a POSIX read lock on the config file when operating
|
||||
on the repository.
|
||||
|
||||
|
||||
Keys
|
||||
----
|
||||
The key to address the key/value store is usually computed like this:
|
||||
|
||||
key = id = id_hash(unencrypted_data)
|
||||
|
||||
The id_hash function is:
|
||||
|
||||
* sha256 (no encryption keys available)
|
||||
* hmac-sha256 (encryption keys available)
|
||||
|
||||
|
||||
Segments and archives
|
||||
---------------------
|
||||
|
||||
|project_name| is a "filesystem based transactional key value
|
||||
store". It makes extensive use of msgpack_ to store data and, unless
|
||||
A |project_name| repository is a filesystem based transactional key/value
|
||||
store. It makes extensive use of msgpack_ to store data and, unless
|
||||
otherwise noted, data is stored in msgpack_ encoded files.
|
||||
|
||||
Objects referenced by a key (256bits id/hash) are stored inline in
|
||||
files (`segments`) of size approx 5MB in ``repo/data``. They contain:
|
||||
Objects referenced by a key are stored inline in files (`segments`) of approx.
|
||||
5MB size in numbered subdirectories of ``repo/data``.
|
||||
|
||||
They contain:
|
||||
|
||||
* header size
|
||||
* crc
|
||||
|
@ -77,21 +97,26 @@ Tag is either ``PUT``, ``DELETE``, or ``COMMIT``. A segment file is
|
|||
basically a transaction log where each repository operation is
|
||||
appended to the file. So if an object is written to the repository a
|
||||
``PUT`` tag is written to the file followed by the object id and
|
||||
data. And if an object is deleted a ``DELETE`` tag is appended
|
||||
data. If an object is deleted a ``DELETE`` tag is appended
|
||||
followed by the object id. A ``COMMIT`` tag is written when a
|
||||
repository transaction is committed. When a repository is opened any
|
||||
``PUT`` or ``DELETE`` operations not followed by a ``COMMIT`` tag are
|
||||
discarded since they are part of a partial/uncommitted transaction.
|
||||
|
||||
The manifest is an object with an id of only zeros (32 bytes), that
|
||||
references all the archives. It contains:
|
||||
|
||||
The manifest
|
||||
------------
|
||||
|
||||
The manifest is an object with an all-zero key that references all the
|
||||
archives.
|
||||
It contains:
|
||||
|
||||
* version
|
||||
* list of archives
|
||||
* list of archive infos
|
||||
* timestamp
|
||||
* config
|
||||
|
||||
Each archive contains:
|
||||
Each archive info contains:
|
||||
|
||||
* name
|
||||
* id
|
||||
|
@ -102,21 +127,21 @@ each time.
|
|||
|
||||
The archive metadata does not contain the file items directly. Only
|
||||
references to other objects that contain that data. An archive is an
|
||||
object that contain metadata:
|
||||
object that contains:
|
||||
|
||||
* version
|
||||
* name
|
||||
* items list
|
||||
* list of chunks containing item metadata
|
||||
* cmdline
|
||||
* hostname
|
||||
* username
|
||||
* time
|
||||
|
||||
Each item represents a file or directory or
|
||||
symlink is stored as an ``item`` dictionary that contains:
|
||||
Each item represents a file, directory or other fs item and is stored as an
|
||||
``item`` dictionary that contains:
|
||||
|
||||
* path
|
||||
* list of chunks
|
||||
* list of data chunks
|
||||
* user
|
||||
* group
|
||||
* uid
|
||||
|
@ -135,124 +160,136 @@ it and it is reset every time an inode's metadata is changed.
|
|||
All items are serialized using msgpack and the resulting byte stream
|
||||
is fed into the same chunker used for regular file data and turned
|
||||
into deduplicated chunks. The reference to these chunks is then added
|
||||
to the archive metadata. This allows the archive to store many files,
|
||||
beyond the ``MAX_OBJECT_SIZE`` barrier of 20MB.
|
||||
to the archive metadata.
|
||||
|
||||
A chunk is an object as well, of course. The chunk id is either
|
||||
HMAC-SHA256_, when encryption is used, or a SHA256_ hash otherwise.
|
||||
A chunk is stored as an object as well, of course.
|
||||
|
||||
Hints are stored in a file (``repo/hints``) and contain:
|
||||
|
||||
* version
|
||||
* list of segments
|
||||
* compact
|
||||
|
||||
Chunks
|
||||
------
|
||||
|
||||
|project_name| uses a rolling checksum with Buzhash_ algorithm, with
|
||||
window size of 4095 bytes (`0xFFF`), with a minimum of 1024, and triggers when
|
||||
the last 16 bits of the checksum are null, producing chunks of 64kB on
|
||||
average. All these parameters are fixed. The buzhash table is altered
|
||||
by XORing it with a seed randomly generated once for the archive, and
|
||||
stored encrypted in the keyfile.
|
||||
|project_name| uses a rolling hash computed by the Buzhash_ algorithm, with a
|
||||
window size of 4095 bytes (`0xFFF`), with a minimum chunk size of 1024 bytes.
|
||||
It triggers (chunks) when the last 16 bits of the hash are zero, producing
|
||||
chunks of 64kiB on average.
|
||||
|
||||
Indexes
|
||||
-------
|
||||
The buzhash table is altered by XORing it with a seed randomly generated once
|
||||
for the archive, and stored encrypted in the keyfile.
|
||||
|
||||
There are two main indexes: the chunk lookup index and the repository
|
||||
index. There is also the file chunk cache.
|
||||
|
||||
The chunk lookup index is stored in ``cache/chunk`` and is indexed on
|
||||
the ``chunk hash``. It contains:
|
||||
Indexes / Caches
|
||||
----------------
|
||||
|
||||
* reference count
|
||||
* size
|
||||
* ciphered size
|
||||
|
||||
The repository index is stored in ``repo/index.%d`` and is also
|
||||
indexed on ``chunk hash`` and contains:
|
||||
|
||||
* segment
|
||||
* offset
|
||||
|
||||
The repository index files are random access but those files can be
|
||||
recreated if damaged or lost using ``check --repair``.
|
||||
|
||||
Both indexes are stored as hash tables, directly mapped in memory from
|
||||
the file content, with only one slot per bucket, but that spreads the
|
||||
collisions to the following buckets. As a consequence the hash is just
|
||||
a start position for a linear search, and if the element is not in the
|
||||
table the index is linearly crossed until an empty bucket is
|
||||
found. When the table is full at 90% its size is doubled, when it's
|
||||
empty at 25% its size is halfed. So operations on it have a variable
|
||||
complexity between constant and linear with low factor, and memory
|
||||
overhead varies between 10% and 300%.
|
||||
|
||||
The file chunk cache is stored in ``cache/files`` and is indexed on
|
||||
the ``file path hash`` and contains:
|
||||
The files cache is stored in ``cache/files`` and is indexed on the
|
||||
``file path hash``. At backup time, it is used to quickly determine whether we
|
||||
need to chunk a given file (or whether it is unchanged and we already have all
|
||||
its pieces).
|
||||
It contains:
|
||||
|
||||
* age
|
||||
* inode number
|
||||
* size
|
||||
* mtime_ns
|
||||
* chunks hashes
|
||||
* file inode number
|
||||
* file size
|
||||
* file mtime_ns
|
||||
* file content chunk hashes
|
||||
|
||||
The inode number is stored to make sure we distinguish between
|
||||
different files, as a single path may not be unique across different
|
||||
archives in different setups.
|
||||
|
||||
The file chunk cache is stored as a python associative array storing
|
||||
python objects, which generate a lot of overhead. This takes around
|
||||
240 bytes per file without the chunk list, to be compared to at most
|
||||
64 bytes of real data (depending on data alignment), and around 80
|
||||
bytes per chunk hash (vs 32), with a minimum of ~250 bytes even if
|
||||
only one chunk hash.
|
||||
The files cache is stored as a python associative array storing
|
||||
python objects, which generates a lot of overhead.
|
||||
|
||||
Indexes memory usage
|
||||
--------------------
|
||||
The chunks cache is stored in ``cache/chunks`` and is indexed on the
|
||||
``chunk id_hash``. It is used to determine whether we already have a specific
|
||||
chunk, to count references to it and also for statistics.
|
||||
It contains:
|
||||
|
||||
Here is the estimated memory usage of |project_name| when using those
|
||||
indexes.
|
||||
* reference count
|
||||
* size
|
||||
* encrypted/compressed size
|
||||
|
||||
Repository index
|
||||
40 bytes x N ~ 200MB (If a remote repository is
|
||||
used this will be allocated on the remote side)
|
||||
The repository index is stored in ``repo/index.%d`` and is indexed on the
|
||||
``chunk id_hash``. It is used to determine a chunk's location in the repository.
|
||||
It contains:
|
||||
|
||||
Chunk lookup index
|
||||
44 bytes x N ~ 220MB
|
||||
* segment (that contains the chunk)
|
||||
* offset (where the chunk is located in the segment)
|
||||
|
||||
File chunk cache
|
||||
probably 80-100 bytes x N ~ 400MB
|
||||
The repository index file is random access.
|
||||
|
||||
Hints are stored in a file (``repo/hints.%d``).
|
||||
It contains:
|
||||
|
||||
* version
|
||||
* list of segments
|
||||
* compact
|
||||
|
||||
hints and index can be recreated if damaged or lost using ``check --repair``.
|
||||
|
||||
The chunks cache and the repository index are stored as hash tables, with
|
||||
only one slot per bucket, but that spreads the collisions to the following
|
||||
buckets. As a consequence the hash is just a start position for a linear
|
||||
search, and if the element is not in the table the index is linearly crossed
|
||||
until an empty bucket is found.
|
||||
|
||||
When the hash table is almost full at 90%, its size is doubled. When it's
|
||||
almost empty at 25%, its size is halved. So operations on it have a variable
|
||||
complexity between constant and linear with low factor, and memory overhead
|
||||
varies between 10% and 300%.
|
||||
|
||||
|
||||
Indexes / Caches memory usage
|
||||
-----------------------------
|
||||
|
||||
Here is the estimated memory usage of |project_name|:
|
||||
|
||||
chunk_count ~= total_file_size / 65536
|
||||
|
||||
repo_index_usage = chunk_count * 40
|
||||
|
||||
chunks_cache_usage = chunk_count * 44
|
||||
|
||||
files_cache_usage = total_file_count * 240 + chunk_count * 80
|
||||
|
||||
mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage
|
||||
= total_file_count * 240 + total_file_size / 400
|
||||
|
||||
All units are Bytes.
|
||||
|
||||
It is assuming every chunk is referenced exactly once and that typical chunk size is 64kiB.
|
||||
|
||||
If a remote repository is used the repo index will be allocated on the remote side.
|
||||
|
||||
E.g. backing up a total count of 1Mi files with a total size of 1TiB:
|
||||
|
||||
mem_usage = 1 * 2**20 * 240 + 1 * 2**40 / 400 = 2.8GiB
|
||||
|
||||
Note: there is a commandline option to switch off the files cache. You'll save
|
||||
some memory, but it will need to read / chunk all the files then.
|
||||
|
||||
In the above we assume 350GB of data that we divide on an average 64KB
|
||||
chunk size, so N is around 5.3 million.
|
||||
|
||||
Encryption
|
||||
----------
|
||||
|
||||
AES_ is used with CTR mode of operation (so no need for padding). A 64
|
||||
bits initialization vector is used, a `HMAC-SHA256`_ is computed
|
||||
on the encrypted chunk with a random 64 bits nonce and both are stored
|
||||
in the chunk. The header of each chunk is : ``TYPE(1)`` +
|
||||
``HMAC(32)`` + ``NONCE(8)`` + ``CIPHERTEXT``. Encryption and HMAC use
|
||||
two different keys.
|
||||
AES_ is used in CTR mode (so no need for padding). A 64bit initialization
|
||||
vector is used, a `HMAC-SHA256`_ is computed on the encrypted chunk with a
|
||||
random 64bit nonce and both are stored in the chunk.
|
||||
The header of each chunk is : ``TYPE(1)`` + ``HMAC(32)`` + ``NONCE(8)`` + ``CIPHERTEXT``.
|
||||
Encryption and HMAC use two different keys.
|
||||
|
||||
In AES CTR mode you can think of the IV as the start value for the
|
||||
counter. The counter itself is incremented by one after each 16 byte
|
||||
block. The IV/counter is not required to be random but it must NEVER be
|
||||
reused. So to accomplish this |project_name| initializes the encryption counter
|
||||
to be higher than any previously used counter value before encrypting
|
||||
new data.
|
||||
In AES CTR mode you can think of the IV as the start value for the counter.
|
||||
The counter itself is incremented by one after each 16 byte block.
|
||||
The IV/counter is not required to be random but it must NEVER be reused.
|
||||
So to accomplish this |project_name| initializes the encryption counter to be
|
||||
higher than any previously used counter value before encrypting new data.
|
||||
|
||||
To reduce payload size only 8 bytes of the 16 bytes nonce is saved in
|
||||
the payload, the first 8 bytes are always zeroes. This does not affect
|
||||
security but limits the maximum repository capacity to only 295
|
||||
exabytes (2**64 * 16 bytes).
|
||||
To reduce payload size, only 8 bytes of the 16 bytes nonce is saved in the
|
||||
payload, the first 8 bytes are always zeros. This does not affect security but
|
||||
limits the maximum repository capacity to only 295 exabytes (2**64 * 16 bytes).
|
||||
|
||||
Encryption keys are either a passphrase, passed through the
|
||||
``BORG_PASSPHRASE`` environment or prompted on the commandline, or
|
||||
stored in automatically generated key files.
|
||||
Encryption keys are either derived from a passphrase or kept in a key file.
|
||||
The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable
|
||||
or prompted for interactive usage.
|
||||
|
||||
Key files
|
||||
---------
|
||||
|
@ -274,22 +311,20 @@ enc_key
|
|||
the key used to encrypt data with AES (256 bits)
|
||||
|
||||
enc_hmac_key
|
||||
the key used to HMAC the resulting AES-encrypted data (256 bits)
|
||||
the key used to HMAC the encrypted data (256 bits)
|
||||
|
||||
id_key
|
||||
the key used to HMAC the above chunks, the resulting hash is
|
||||
stored out of band (256 bits)
|
||||
the key used to HMAC the plaintext chunk data to compute the chunk's id
|
||||
|
||||
chunk_seed
|
||||
the seed for the buzhash chunking table (signed 32 bit integer)
|
||||
|
||||
Those fields are processed using msgpack_. The utf-8 encoded phassphrase
|
||||
is encrypted with PBKDF2_ and SHA256_ using 100000 iterations and a
|
||||
random 256 bits salt to give us a derived key. The derived key is 256
|
||||
bits long. A `HMAC-SHA256`_ checksum of the above fields is generated
|
||||
with the derived key, then the derived key is also used to encrypt the
|
||||
above pack of fields. Then the result is stored in a another msgpack_
|
||||
formatted as follows:
|
||||
Those fields are processed using msgpack_. The utf-8 encoded passphrase
|
||||
is processed with PBKDF2_ (SHA256_, 100000 iterations, random 256 bit salt)
|
||||
to give us a derived key. The derived key is 256 bits long.
|
||||
A `HMAC-SHA256`_ checksum of the above fields is generated with the derived
|
||||
key, then the derived key is also used to encrypt the above pack of fields.
|
||||
Then the result is stored in a another msgpack_ formatted as follows:
|
||||
|
||||
version
|
||||
currently always an integer, 1
|
||||
|
@ -315,3 +350,9 @@ The resulting msgpack_ is then encoded using base64 and written to the
|
|||
key file, wrapped using the standard ``textwrap`` module with a header.
|
||||
The header is a single line with a MAGIC string, a space and a hexadecimal
|
||||
representation of the repository id.
|
||||
|
||||
|
||||
Compression
|
||||
-----------
|
||||
|
||||
Currently, zlib level 6 is used as compression.
|
||||
|
|
|
@ -0,0 +1,116 @@
|
|||
About borg create --chunker-params
|
||||
==================================
|
||||
|
||||
--chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE
|
||||
|
||||
CHUNK_MIN_EXP and CHUNK_MAX_EXP give the exponent N of the 2^N minimum and
|
||||
maximum chunk size. Required: CHUNK_MIN_EXP < CHUNK_MAX_EXP.
|
||||
|
||||
Defaults: 10 (2^10 == 1KiB) minimum, 23 (2^23 == 8MiB) maximum.
|
||||
|
||||
HASH_MASK_BITS is the number of least-significant bits of the rolling hash
|
||||
that need to be zero to trigger a chunk cut.
|
||||
Recommended: CHUNK_MIN_EXP + X <= HASH_MASK_BITS <= CHUNK_MAX_EXP - X, X >= 2
|
||||
(this allows the rolling hash some freedom to make its cut at a place
|
||||
determined by the windows contents rather than the min/max. chunk size).
|
||||
|
||||
Default: 16 (statistically, chunks will be about 2^16 == 64kiB in size)
|
||||
|
||||
HASH_WINDOW_SIZE: the size of the window used for the rolling hash computation.
|
||||
Default: 4095B
|
||||
|
||||
|
||||
Trying it out
|
||||
=============
|
||||
|
||||
I backed up a VM directory to demonstrate how different chunker parameters
|
||||
influence repo size, index size / chunk count, compression, deduplication.
|
||||
|
||||
repo-sm: ~64kiB chunks (16 bits chunk mask), min chunk size 1kiB (2^10B)
|
||||
(these are attic / borg 0.23 internal defaults)
|
||||
|
||||
repo-lg: ~1MiB chunks (20 bits chunk mask), min chunk size 64kiB (2^16B)
|
||||
|
||||
repo-xl: 8MiB chunks (2^23B max chunk size), min chunk size 64kiB (2^16B).
|
||||
The chunk mask bits was set to 31, so it (almost) never triggers.
|
||||
This degrades the rolling hash based dedup to a fixed-offset dedup
|
||||
as the cutting point is now (almost) always the end of the buffer
|
||||
(at 2^23B == 8MiB).
|
||||
|
||||
The repo index size is an indicator for the RAM needs of Borg.
|
||||
In this special case, the total RAM needs are about 2.1x the repo index size.
|
||||
You see index size of repo-sm is 16x larger than of repo-lg, which corresponds
|
||||
to the ratio of the different target chunk sizes.
|
||||
|
||||
Note: RAM needs were not a problem in this specific case (37GB data size).
|
||||
But just imagine, you have 37TB of such data and much less than 42GB RAM,
|
||||
then you'ld definitely want the "lg" chunker params so you only need
|
||||
2.6GB RAM. Or even bigger chunks than shown for "lg" (see "xl").
|
||||
|
||||
You also see compression works better for larger chunks, as expected.
|
||||
Duplication works worse for larger chunks, also as expected.
|
||||
|
||||
small chunks
|
||||
============
|
||||
|
||||
$ borg info /extra/repo-sm::1
|
||||
|
||||
Command line: /home/tw/w/borg-env/bin/borg create --chunker-params 10,23,16,4095 /extra/repo-sm::1 /home/tw/win
|
||||
Number of files: 3
|
||||
|
||||
Original size Compressed size Deduplicated size
|
||||
This archive: 37.12 GB 14.81 GB 12.18 GB
|
||||
All archives: 37.12 GB 14.81 GB 12.18 GB
|
||||
|
||||
Unique chunks Total chunks
|
||||
Chunk index: 378374 487316
|
||||
|
||||
$ ls -l /extra/repo-sm/index*
|
||||
|
||||
-rw-rw-r-- 1 tw tw 20971538 Jun 20 23:39 index.2308
|
||||
|
||||
$ du -sk /extra/repo-sm
|
||||
11930840 /extra/repo-sm
|
||||
|
||||
large chunks
|
||||
============
|
||||
|
||||
$ borg info /extra/repo-lg::1
|
||||
|
||||
Command line: /home/tw/w/borg-env/bin/borg create --chunker-params 16,23,20,4095 /extra/repo-lg::1 /home/tw/win
|
||||
Number of files: 3
|
||||
|
||||
Original size Compressed size Deduplicated size
|
||||
This archive: 37.10 GB 14.60 GB 13.38 GB
|
||||
All archives: 37.10 GB 14.60 GB 13.38 GB
|
||||
|
||||
Unique chunks Total chunks
|
||||
Chunk index: 25889 29349
|
||||
|
||||
$ ls -l /extra/repo-lg/index*
|
||||
|
||||
-rw-rw-r-- 1 tw tw 1310738 Jun 20 23:10 index.2264
|
||||
|
||||
$ du -sk /extra/repo-lg
|
||||
13073928 /extra/repo-lg
|
||||
|
||||
xl chunks
|
||||
=========
|
||||
|
||||
(borg-env)tw@tux:~/w/borg$ borg info /extra/repo-xl::1
|
||||
Command line: /home/tw/w/borg-env/bin/borg create --chunker-params 16,23,31,4095 /extra/repo-xl::1 /home/tw/win
|
||||
Number of files: 3
|
||||
|
||||
Original size Compressed size Deduplicated size
|
||||
This archive: 37.10 GB 14.59 GB 14.59 GB
|
||||
All archives: 37.10 GB 14.59 GB 14.59 GB
|
||||
|
||||
Unique chunks Total chunks
|
||||
Chunk index: 4319 4434
|
||||
|
||||
$ ls -l /extra/repo-xl/index*
|
||||
-rw-rw-r-- 1 tw tw 327698 Jun 21 00:52 index.2011
|
||||
|
||||
$ du -sk /extra/repo-xl/
|
||||
14253464 /extra/repo-xl/
|
||||
|
|
@ -50,6 +50,9 @@ Examples
|
|||
NAME="root-`date +%Y-%m-%d`"
|
||||
$ borg create /mnt/backup::$NAME / --do-not-cross-mountpoints
|
||||
|
||||
# Backup huge files with little chunk management overhead
|
||||
$ borg create --chunker-params 19,23,21,4095 /mnt/backup::VMs /srv/VMs
|
||||
|
||||
|
||||
.. include:: usage/extract.rst.inc
|
||||
|
||||
|
|
Loading…
Reference in New Issue