this fixes a strange test failure that did not happen until now:
it could not read the MAGIC bytes from a (quite new) segment file,
it just returned the empty string.
maybe its appearance is related to the removed I/O calls.
This saves some segment file random IO that was previously necessary
just to determine the size of to be deleted data.
Keep old one as NSIndex1 for old borg compatibility.
Choose NSIndex or NSIndex1 based on repo index layout from HashHeader.
for an old repo index repo.get(key) returns segment, offset, None, None
if a hardlink copy of a repo was made and a new repo config
shall be saved, do NOT fill in random garbage before deleting
the previous repo config, because that would damage the hardlink
copy.
see ticket and borg.helpers.msgpack docstring.
this changeset implements the full migration to
msgpack 2.0 spec (use_bin_type=True, raw=False).
still needed compat to the past is done via want_bytes decoder in borg.item.
when migrating from repokey to keyfile, we just store an empty key into the repo config,
because we do not have a "delete key" RPC api. thus, empty key means "there is no key".
here we fix load_key, so that it does not behave differently for no key and empty key:
in both cases, it just returns an empty value.
additionally, we strip the value we get from the config, so whitespace does not matter.
All callers now check for the repokey not being empty, otherwise RepoKeyNotFoundError
is raised.
for now, this code shall only work on v2 repos (created by this code).
the code to read v1 repos is still present though, so for experiments,
it is possible to change the repo version in the repo config from 1 to
2 manually.
having version 2 in the repo config also avoids that borg < 1.3 is
used on such a repo, which would cause damage:
old borg would not recognize the PUT2 tagged segment entries and
old borg check --repair would likely kill them all due to that.
also: keep repo version in Repository.version
note: this required a slight increase of MAX_OBJECT_SIZE so that MAX_DATA_SIZE
could stay the same as before.
For PUT2, compute the hash over the whole entry (header and content, excluding
hash and crc32 fields, because the crc32 computation includes the hash).
Also: refactor crc32 checks into function, use f-strings, structure _read in
a more logical sequential order.
write_put: avoid creating a large temporary bytes object
why use xxh64?
- fast even without hw acceleration
- borg depends on it already anyway
- stronger than crc32 and strong enough for this purpose
attic is borg's parent project, but it stalled in 2015 and was not updated since then.
guess we can assume that most attic users have meanwhile noticed this and already
converted their repos to borg.
if some did not yet, they are advised to use borg < 1.3 to do that ASAP.
note: borg can still DETECT an attic repo by recognizing its ATTIC_MAGIC value
and then gives exactly that advice.
Code gets simpler if we always only use the (shorter) header_fmt.
That format ALWAYS applies, to all tags borg writes.
If the tag unpacked from there indicates that there is also a chunkid
to read (like for PUT and DEL), we can decide that inside _read and
then read the chunkid from the fd.
compact_segments produced separate 17b files for intermediate commits, although they were intended to be end-of-segment-file commits.
this is because when the intermediate commit is triggered, we are already at an offset beyond the limit.
thus needed to add the no_new flag to indicate that we do not want a new segment file just for the commit IF it is an intermediate commit.
storage_quota_use should reflect current disk space usage (not considering some overheads like for the index etc.).
if a chunk is deleted, but the segment file containing the chunk is not yet compacted, the chunk's disk space is still in use!
when compact_segments is dropping the unused chunks, it is the right time to reduce storage_quota_use.
storage_quota_use includes the put header overhead.
This too should make the scan faster as, assuming the data is
random, we can skip CRC checks for almost 94% of the incorrect
header location solely based on the tag.
As draw back, this will limit the number of tags that can be
added without breaking backwards compatibility to 16, with
13 currently unused.
When an object is corrupted, the start position of the next object
will not be known as the size field belonging to the corrupted
object may be corrupted as well. In order to find the next object
within the segment, the remainder is scanned for the next valid
object, byte-by-byte. An object is considered valid if the CRC
checksum matches the content. However, doing so the scan accepted
any object size that fit within the remainder of the segment. As a
result, in particular when the corruption occurred near the start
of a segment, CRC checksums were calculated for large objects,
often hundreds of megabytes in size, despite the size being limited
to 20 MiB. This change makes it so that CRC calculation is skipped
when the object header indicates an impossible size, thereby,
greatly reducing the number of CPU cycles used for CRC calculations.
In my case, this brought down the time for repair from hours to mere
minutes.
This has also the additional benefit that there is some verification
in addition to the CRC checksum. The 4-bytes checksum is rather
short considering the amount of data that might be in an archive.
Likely fixes the hanging --repair in #5995 also.
A) the compaction code needs the shadow index only for this case:
segment A: PUT x, segment B: DEL x, with A < B (DEL shadows the PUT).
B) for the following case, we have no shadowing DEL (or rather: it does not matter,
because there is a PUT right after the DEL) and x is in the repo index,
thus the shadow_index is not needed for the special case in the compaction code:
segment A: PUT x, segment B: DEL x PUT x
see also PR #5636.
reverts f079a83fed
and clarifies the code by more comments.
we keep the code deduplication of 5f32b5666a
and just add a update_shadow_index param to make it not look like there was
something accidentally forgotten, which was the whole reason for the reverted
"fix".
The shadow_index should be in same state after both of these sequences
(let's assume that A is not in repo yet for simplicity, but it does not matter):
a) explicit delete: put(A), delete(A), put(A), resulting in: PUT A, DEL A, PUT A repo contents
b) implicit delete: put(A), put(A), resulting in: PUT A, DEL A, PUT A repo contents
cleaner teardown of contexts:
close mmap, close src_fd (reading), close dst_fd (and rename)
maybe it was not a real problem to rename a still open-for-reading / mmapped file,
but in any case it is cleaner like now.
We already have used SaveFile context manager since long at other places.
By using it, the original segment file stays in place until recovery of it
is completed (writing/syncing into *.tmp).
On successful completion, .tmp is renamed over original + dir syncing.
If aborted by some exception, including Ctrl-C, the original file is unmodified.
in borg 1.1, compact_segments() was always run directly after some repo writing
operation (in same borg process). but now, only "borg compact" is used to compact
segments and it is a separate borg invocation (new process), so we need to persist
the shadow_index so we do not lose that information.
if the rebuilt index size matched the on-disk index size AND there
was a difference in e.g. 1 key, the old code only output the key/value
for one index, but not what is present in the other index.
we already had better code in the branch for different index sizes,
so just use that for both cases.
additionally we tell when the index size matches (new) because we
also tell if there is a mismatch.
at least it does not crash now when committing.
the question why the compact map points to a missing segment file
is not answered yet, there might be another problem...
if an old hints file gets converted to the new format and it
has entries referring to non-existent segment files, a crash
occurred.
with this code, the crash is avoided and the erroneous hints
entry is removed.
support platforms with no os.link, fixes#4901
if we don't have os.link, we just extract another copy instead of making a hardlink.
for that to work, we need to have (and keep) the chunks list in hardlink_masters.
we create the hardlink to be able to secure erase the old config file.
if we can't do that because there is just a problem with hardlinks not
working, the old config will be just overwritten normally (not secure
erased). the user will get a warning in that case, but other than that,
the overall borg operation will succeed.
if there is a bigger problem (like a general lack of permissions or a
general issue with the underlying fs), subsequent operations will fail.
- Created a batch file to build borg on windows
- Adjusted setup.py to be runnable on windows and build the windows
extension
- Extracted the free space check to a function in the platform module
- Created the minimal needed (dummy) functions for the windows platform
module
if the repo config is not there, we definitely have a invalid repo.
for other problems (like permission issues), we'll just let it blow
up with a traceback, so the user can see what the precise problem is.
drop BORG_HOSTNAME_IS_UNIQUE (please use BORG_HOST_ID if needed)
borg now always assumes it has a unique hostid - either automatically
from fqdn plus uuid.getnode() or overridden via BORG_HOST_ID.
before this, it over-eagerly compacted "small" segments ("small"
being < 100MB by default) if there were only a few bytes to be freed.
also:
- improve debug logging
- as compaction is a separate borg command now, use the module logger
intended as a last resort measure to export all segment file contents
in a relatively easy to use format.
if you want to dig into a damaged repo (e.g. missing segment files,
missing commits) and you know what you do.
note: dump-repo-objs --ghost must not use repo.list()
because this would need the repo index and call get_transaction_id and
check_transaction methods, which can easily fail on a damaged repo.
thus we use the same low level scan method as we use anyway to get
some encrypted piece of data to setup the decryption "key".
(cherry picked from commit 8738e85967)
wrap msgpack to avoid future upstream api changes making troubles
or that we would have to globally spoil our code with extra params.
make sure the packing is always with use_bin_type=False,
thus generating "old" msgpack format (as borg always did) from
bytes objects.
make sure the unpacking is always with raw=True,
thus generating bytes objects.
note:
safe unicode encoding/decoding for some kinds of data types is done in Item
class (see item.pyx), so it is enough if we care for bytes objects on the
msgpack level.
also wrap exception handling, so borg code can catch msgpack specific
exceptions even if the upstream msgpack code raises way too generic
exceptions typed Exception, TypeError or ValueError.
We use own Exception classes for this, upstream classes are deprecated
specialcase deleting / writing the manifest to be in a separate, new
segment file, so that when we supersede and compact it later, less
segment data has to be shuffled around - compaction can then just
delete this segment file and that's all.
C code and the repo index use uint32 type for segment file offsets,
so when opening a repo and the config max_segment_size is too big,
fail early.
Also disallow setting a too big value via "borg config".
When opening a repository, always try to read the magic number of the
latest segment and compare it to the Attic segment magic (unless the
repository is opened for upgrading). If an Attic segment is detected,
raise a dedicated exception, telling the user to upgrade the repository
first.
Fixes#1933.