a file map can be:
- created internally inside chunkify by calling sparsemap, which uses
SEEK_DATA / SEEK_HOLE to determine data and hole ranges inside a
seekable sparse file.
Usage: borg create --sparse --chunker-params=fixed,BLOCKSIZE ...
BLOCKSIZE is the chunker blocksize here, not the filesystem blocksize!
- made by some other means and given to the chunkify function.
this is not used yet, but in future this could be used to only read
the changed parts and seek over the (known) unchanged parts of a file.
sparsemap: the generate range sizes are multiples of the fs block size.
the tests assume 4kiB fs block size.
allow creating archives using stdout of given command
In addition to allowing:
some-command --param value | borg create REPO::ARCH -
also allow:
borg create --content-from-command create REPO::ARCH -- some-command --param value
The difference is that the latter approach deals with errors properly.
In the former example, an archive is created no matter what. Even, if
`some-command` aborts and the output is truncated, Borg won't realize.
In the latter example, the status code is checked and archive creation
is aborted properly when appropriate.
The locally defined preload() function overwrites the preload boolean keyword
argument, always evaluating to true, so preloading is done, even when not
requested by the caller, causing a memory leak.
Also move its definition outside of the loop.
This issue was found by Antonio Larrosa in borg issue #5202.
The code used for error reporting crashes due to an invalid utf-8
sequence. Use errors='replace' to never crash there. Errors
are expected in input data when borg check is run.
support platforms with no os.link, fixes#4901
if we don't have os.link, we just extract another copy instead of making a hardlink.
for that to work, we need to have (and keep) the chunks list in hardlink_masters.
if the file is not a regular file, but a hardlink slave with a not
extracted hardlink master, chunks will be None and we must not call
preload(chunks).
(cherry picked from commit 291d58efa1)
On windows os.open does not work for directories.
If borg tries to open an directory on windows, None is returned
as file descriptor. The archive and archiver where adjusted to
handle the case if a file descriptor is None.
on linux, acls are based on xattrs, so do these closeby:
1. listxattr -> keys (without acl related keys)
2. for all keys: getxattr
3. acl-related getxattr by acl library
for fd-based operations, we would have to open the file, but for
char / block devices this has unwanted effects, even if we do not
read from the device.
thus, we use path (or dir_fd + name) based ops here.
races via changing path components can be avoided by opening the
parent directory and using parent_fd + file_name combination with
*at style functions to access the directories' contents.
wrap msgpack to avoid future upstream api changes making troubles
or that we would have to globally spoil our code with extra params.
make sure the packing is always with use_bin_type=False,
thus generating "old" msgpack format (as borg always did) from
bytes objects.
make sure the unpacking is always with raw=True,
thus generating bytes objects.
note:
safe unicode encoding/decoding for some kinds of data types is done in Item
class (see item.pyx), so it is enough if we care for bytes objects on the
msgpack level.
also wrap exception handling, so borg code can catch msgpack specific
exceptions even if the upstream msgpack code raises way too generic
exceptions typed Exception, TypeError or ValueError.
We use own Exception classes for this, upstream classes are deprecated
the os level file handle is enough, the chunker will prefer it if
valid and won't use the file obj, so we can give None there.
this saves these unneeded syscalls:
fstat(5, {st_mode=S_IFREG|0664, st_size=227063, ...}) = 0
ioctl(5, TCGETS, 0x7ffd635635f0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(5, 0, SEEK_CUR) = 0
this optimization is only needed for linux, the bsd-like platforms
do not need an open file to run a ioctl against, but have bsdflags
in the stat result already.
on linux, this optimization saves 1 file open/close per input file.
when processing regular files, use a fd to query xattrs.
when the file was modified and we chunked it, we have it open anyways.
if not, we open the file once and then query xattrs, in the hope that
this is more efficient than the path based calls.
guess it is less prone to race conditions in any case.
only output msgs if there is actually something to delete.
be more precise, show count of orphaned / superseded objects.
(cherry picked from commit d671e9acf2)
reading the files cache can take considerable amount of time (a user
reported 1h 42min for a 700MB files cache for a repo with 8M files and
15TB total), so we must init the checkpoint timer after that or borg
will create the checkpoint too early.
creating a checkpoint means (among other stuff) saving the files cache,
which will also take a lot of time in such a case, one time too much.
doing this in a clean way required some refactoring:
- cache_mode is now given to Cache initializer and stored in instance
- the files cache is loaded early in _do_open (if needed)
- size inconsistencies
- file has all-zero replacement chunks
introduced new BackupError exception. when raised while extracting
files, gets handled via emitting a warning, setting rc=1 and
proceeding to next file.
the problem was that the upper layer code did not have enough information
about the file, whether it is known or not - and thus, could not decide
correctly whether status should be M)odified or A)dded.
now, file_known_and_unchanged method returns an additional "known"
boolean to fix this.
also: add comment about files cache loading in cache_mode='r'
the major problem was the ('path' in item) expression.
the dict has bytes-typed keys there, so it never succeeded as it
looked for a str key. this is a 1.1 regression, 1.0 was fine.
the dict -> StableDict change is just for being more specific,
the check triggered correctly as StableDict subclasses dict,
it was just a bit too general.
(cherry picked from commit e09892caec)
include item birthtime in archive, fixes#3272
* use `safe_ns` when reading birthtime into attributes
* proper order for `birthtime` in `ITEM_KEYS` list
* use `bigint` wrapper for consistency
* Add tests to verify that birthtime is normally preserved, but not preserved when `--nobirthtime` is passed to `borg create`.
do no read/archive bsdflags: borg create --nobsdflags ...
do not extract/set bsdflags: borg extract --nobsdflags ...
use cases:
- fs shows wrong / random bsdflags (bug in filesystem)
- fs does not support bsdflags anyway
- already archived bsdflags are wrong / unwanted
- borg shows any sort of unwanted effect due to get_flags, esp. on Linux
the nodump flag ("do not backup this file") is not honoured any more by
default because this functionality (esp. if it happened by error or
unexpected) was rather confusing and unexplainable at first to users.
if you want that "do not backup NODUMP-flagged files" behaviour, use:
borg create --exclude-nodump ...
when doing in-file checkpointing, borg creates *.borg_part_N files.
complete_file = part_1 + part_2 + ... + part_N
the source item for recreate already has a precomputed (total) size
member, thus we must force recomputation from the (partial) chunks
list to correct the size to be the part's size only.
borg create avoided this problem by computing the size member after
writing all the parts. this is now not required any more.
the bug is mostly cosmetic, borg check will complain, borg extract on
a part file would also complain. but all the complaints only refer to
the wrong metadata of the part files, the part files' contents are
correct.
usually you will never extract or look at part files, but only deal
with the full file, which will be completely valid, all metadata and
content.
you can get rid of the archives with these cosmetic errors by running
borg recreate on them with a fixed borg version. the old part files
will get dropped (because they are usually ignored) and any new part
file created due to checkpointing will be correct.
You can now control the files cache mode using this option:
--files-cache={ctime,mtime,size,inode,rechunk,disabled}*
(only some combinations are supported)
Previously, only these modes were supported:
- mtime,size,inode (default of borg < 1.1.0rc4)
- mtime,size (by using --ignore-inode)
- disabled (by using --no-files-cache)
Now, you additionally get:
- ctime alternatively to mtime (more safe), e.g.:
ctime,size,inode (this is the new default of borg >= 1.1.0rc4)
- rechunk (consider all files as changed, rechunk them)
Deprecated:
- --ignore-inodes (use modes without "inode")
- --no-files-cache (use "disabled" mode)
The tests needed some changes:
- previously, we use os.utime() to set a files mtime (atime) to specific
values, but that does not work for ctime.
- now use time.sleep() to create the "latest file" that usually does
not end up in the files cache (see FAQ)
This factors out a lot of the logic in do_diff in archiver.py to Archive in
archive.py and a new class ItemDiff in item.pyx. The idea is to move methods
to the classes that are affected and to make it reusable, primarily for a new
option to fuse (#2475).
chunk_incref was called when dealing with part files without giving the
known chunk size in the size_ parameter.
adjusted LocalCache.chunk_incref to have same signature.
lgtm:
Calling next() in a generator may cause unintended early termination of
an iteration.
It seems that lgtm did not detect the more loose wrapping that we used
before.
This should allow us to make sure older borg versions can be cleanly
prevented from doing operations that are no longer safe because of
repository format evolution. This allows more fine grained control than
just incrementing the manifest version. So for example a change that
still allows new archives to be created but would corrupt the repository
when an old version tries to delete an archive or check the repository
would add the new feature to the check and delete set but leave it out
of the write set.
This is somewhat inspired by ext{2,3,4} which uses sets for
compat (everything except fsck), ro-compat (may only be accessed
read-only by older versions) and features (refuse all access).
the timestamps of the recreated archive (in the archive metadata and
also in the manifest) are now as they were for the original archive.
they are important metadata about the archive contents and should
therefore be kept "as is".
note: when using -v --stats, the timestamps shown there for recreate
are about the recreate start/end/duration.
just getting data from the repo can already raise IntegrityErrors
in LoggedIO, so we need to catch them also.
see also the code a few lines above where this is done in the same way.
This fixes the problem raised by issue #2314 by requiring that each root
subtree be fully traversed.
The problem occurs when a patterns file excludes a parent directory P later
in the file, but earlier in the file a subdirectory S of P is included.
Because a tree is processed recursively with a depth-first search, P is
processed before S is. Previously, if P was excluded, then S would not even
be considered. Now, it is possible to recurse into P nonetheless, while not
adding P (as a directory entry) to the archive.
With this commit, a `-` in a patterns-file will allow an excluded directory
to be searched for matching descendants. If the old behavior is desired, it
can be achieved by using a `!` in place of the `-`.
The following is a list of specific changes made by this commit:
* renamed InclExclPattern named-tuple -> CmdTuple (with names 'val' and 'cmd'), since it is used more generally for commands, and not only for representing patterns.
* represent commands as IECommand enum types (RootPath, PatternStyle, Include, Exclude, ExcludeNoRecurse)
* archiver: Archiver.build_matcher() paths arg renamed -> include_paths to prevent confusion as to whether the list of paths are to be included or excluded.
* helpers: PatternMatcher has recurse_dir attribute that is used to communicate whether an excluded dir should be recursed (used by Archiver._process())
* archiver: Archiver.build_matcher() now only returns a PatternMatcher instance, and not an include_patterns list -- this list is now created and housed within the PatternMatcher instance, and can be accessed from there.
* moved operation of finding unmatched patterns from Archiver to PatternMatcher.get_unmatched_include_patterns()
* added / modified some documentation of code
* renamed _PATTERN_STYLES -> _PATTERN_CLASSES since "style" is ambiguous and this helps clarify that the set contains classes and not instances.
* have PatternBase subclass instances store whether excluded dirs are to be recursed. Because PatternBase objs are created corresponding to each +, -, ! command it is necessary to differentiate - from ! within these objects.
* add test for '!' exclusion rule (which doesn't recurse)
Most code of the CM is just moved 1:1 from the regular file block.
Use the CM for regular files, FIFOs and devices, but not for:
- directories (can not have hardlinks)
- symlinks (we can not support hardlinked symlinks)
- nlink > 1 for dirs does not mean hardlinking
(at least not everywhere, wondering how apple does it)
- we can not archive hardlinked symlinks due to item.source dual-use,
see issue #2343.
likely nobody uses this anyway.
make_parent(path) helper to reduce code duplication.
also use it for directories although makedirs can also do it.
bugfix: also create parent dir for device files, if needed.
if a hardlink master is not in the to-be-extracted subset, the "x"
status was not displayed for it.
also, the matcher was called twice for matching items.
Before this changeset, async responses were:
- if not an error: ignored
- if an error: raised as response to the arbitrary/unrelated next command
Now, after sending async commands, the async_response command must be used
to process outstanding responses / exceptions.
We are avoiding to pile up lots of stuff in cases of high latency, because we do NOT
first wait until ALL responses have arrived, but we just can begin to process responses.
Calls with wait=False will just return what we already have received.
Repeated calls with wait=True until None is returned will fetch all responses.
Async commands now actually could have non-exception non-None results, but
this is not used yet. None responses are still dropped.
The motivation for this is to have a clear separation between a request
blowing up because it (itself) failed and failures unrelated to that request /
to that line in the sourcecode.
also: fix processing for async repo obj deletes
exception_ignored is a special object used that is "not None" (as None is used to signal
"finished with processing async results") but also not a potential async response result value.
Also:
added wait=True to chunk_decref() and add_chunk()
this makes async processing explicit - the default is synchronous and you only
need to be careful and do extra steps for async processing if you explicitly
request async by calling with wait=False (usually for speed reasons).
to process async results, use async_response, see above.
the bug was compr_args.update(compr_spec), helpers.py:2168 - that mutated
the compression spec dict (and not just some local one, but the compr spec
dict parsed from the commandline args).
so a change that was intended just for 1 chunk changed the desired
compression level on the archive scope.
I refactored the stuff to use a namedtuple (which is immutable, so such
effects can not happen again).
if an item has a chunk list, pre-compute the total size and store it into "size" metadata entry.
this speeds up access to item size (e.g. for regular files) and could also be used to verify the validity of the chunks list.
note about hardlinks: size is only stored for hardlink masters (only they have an own chunk list)
See #1452
This is 100 % accurate.
Also increases maximum data size by ~41 bytes. Not 100 % side-effect free;
if you manage to exactly land in that area then older Borg would not read
it. OTOH it gives us a nice round number there.
we do not trust the remote, so we are careful unpacking its responses.
the remote could return manipulated msgpack data that announces e.g.
a huge array or map or string. the local would then need to allocate huge
amounts of RAM in expectation of that data (no matter whether really
that much is coming or not).
by using limits in the Unpacker, a ValueError will be raised if unexpected
amounts of data shall get unpacked. memory DoS will be avoided.
Add --keep-exclude-tags option as alias to --keep-tag-files and
deprecate the later. Also make tagging accept directories as tags,
allowing things like `--exclude-if-present .git`.
fixes#1999
This is some 15 times faster than @contextmanager, because no instance
creation is involved and no generator has to be maintained. Overall
difference is low, but still nice for a very simple change.
we use "debug xxx" subcommands now. docs updated.
also makes "borg help" shorter as not all debug-xxx commands
show up, but just 1 main "debug" command.