also add a test: recreate without --chunker-params shall not rechunk
before the fix, it triggered rechunking if an archive
was created with non-default chunker params.
but it only should rechunk if borg recreate is invoked with explicitly giving --chunker-params=....
test hashtable expansion/rebuild.
hashindex_lookup:
- return -2 for a compact / completely full hashtable
- return -1 and (via start_idx pointer) the deleted/tombstone bucket index.
fix size assertion (we add 1 element to trigger rebuild)
fix upper_limit check - since we'll be adding 1 to num_entries below,
the condition should be >=:
hashindex_compact: set min_empty/upper_limit
Co-authored-by: Dan Christensen <jdc+github@uwo.ca>
we now just treat that one .borg_part file we might have inside
checkpoint archives as a normal file.
people can recognize via the file name it is a partial file.
nobody cares for statistics of checkpoint files and the final
archive now does not contain any partial files any more, thus
no needs to maintain statistics about count and size of part
files.
checkpoint archives might have a single, incomplete part file as last item.
part files are always a prefix of the full file, growing in size from
checkpoint to checkpoint.
we now manage the archive items metadata stream in a special way:
- checkpoint archive A(n) might end with a partial item PI(n)
- checkpoint archive A(n+1) does not contain PI(n)
- checkpoint archive A(n+1) contains a new partial item PI(n+1)
- the final archive does not contain any partial items
not having this had created orphaned item_ptrs chunks for checkpoint archives.
also:
- borg check: show id of orphaned chunks
- borg check: archive list with explicit consider_checkpoints=True (this is the default, but better make sure).
check --archives: add --newer/--older/--newest/--oldest, fixes#7062
Options accept a timespan, like Nd for N days or Nm for N months.
Use these to do date-based matching on archives and only check some of them,
like: borg check --archives --newer=1m --newest=7d
Author: Michael Deyaso <mdeyaso@fusioniq.io>
Same change for .recreate_cmdline -> .recreate_command_line .
JSON output key "command_line":
borg 1.x: sys.argv [list of str]
borg 2: shlex.join(sys.argv) [str]
binary bytes:
- json_key = <key>_b64
- json_value == base64(value)
text (potentially with surrogate escapes):
- json_key1 = <key>
- json_value1 = value_text (s-e replaced by ?)
- json_key2 = <key>_b64
- json_value2 = base64(value_binary)
json_key2/_value2 is only present if value_text required
replacement of surrogate escapes (and thus does not represent
the original value, but just an approximation).
value_binary then gives the original bytes value (e.g. a
non-utf8 bytes sequence).
using "differenthost" (== not the current hostname) makes
the process_alive check always return True (to play safe,
because in can not check for processes on other hosts).
python's io.BufferedWriter sizes its buffer based on st_blksize.
If the write fits in this buffer, then it's possible the data from
idx.write() has not been flushed through to ,the underlying filesystem,
and getsize(fileno()) sees a too-short (or even empty) file.
Also, getsize is only documented as accepting path-like objects;
passing a fileno seems to work only because the implementation
blindly forwards everything through to os.stat without checking.
Passing unopened_tempfile avoids all three problems
- on windows, it doesn't rely on re-opening NamedTemporaryFile
(the issue which led to cc0ad321dc)
- we're following the documented API of getsize(path-like)
- the file is closed (thus flushed) inside idx.write, before getsize()
One cannot "to not x", but one can "not to x".
Avoiding split infinitives gives the added bonus that machine
translation yields better results.
setup (n/adj) vs set(v) up. We don't "I setup it" but "I set it up".
Likewise for login(n/adj) and log(v) in, backup(n/adj) and back(v) up.
\n is automatically converted on write to the platform-dependent os.linesep.
Using os.linesep instead of \n means that on Windows, the line ending becomes "\r\r\n".
Also switches mentions of {LF} to {NL} in code and docs.
On Windows, the ":" character cannot be used in a filename.
Python does not error on this because the ":" character represents data streams.
See https://stackoverflow.com/a/54508979
this option did not change behaviour since longer,
we only had kept it for API compatibility.
as a borg2 repo server won't have old clients talking to it,
we can safely remove this everywhere now.
we want to be able to use an archive name as a directory name,
e.g. for the FUSE fs built by borg mount.
thus we can not allow "/" in an archive name on linux.
on windows, the rules are more restrictive, disallowing
quite some more characters (':<>"|*?' plus some more).
we do not have FUSE fs / borg mount on windows yet, but
we better avoid any issues.
we can not avoid ":" though, as our {now} placeholder
generates ISO-8601 timestamps, including ":" chars.
also, we do not want to have leading/trailing blanks in
archive names, neither surrogate-escapes.
control chars are disallowed also, including chr(0).
we have python str here, thus chr(0) is not expected in there
(is not used to terminate a string, like it is in C).
looks like that chmod should only get done IF we are root (and on linux?).
taking away write permissions on windows/cygwin (and when running as normal
user) makes create_regular_file fail when it tries to create dir2/file3.
- file status A/M/E counters
- chunking time
- hashing time
- rx_bytes / tx_bytes
Note: the sleep() in the test is needed due to timestamp granularity on linux being much more coarse than expected (uses the system timer, 100Hz or 250Hz).
support reading new, improved hashindex header format, fixes#6960
Bit of a pain to work with that code:
- C code
- needs to still be able to read the old hashindex file format,
- while also supporting the new file format.
- the hash computed while reading the file causes additional problems because
it expects all places in the file get read exactly once and in sequential order.
I solved this by separately opening the file in the python part of the code and
checking for the magic.
BORG_IDX means the legacy file format and legacy layout of the hashtable,
BORG2IDX means the new file format and the new layout of the hashtable.
Done:
- added a version int32 directly after the magic and set it to 2 (like borg 2).
the old header had no version info, but could be denoted as version 1 in case
we ever need it (currently it decides based on the magic).
- added num_empty as indicated by a TODO in count_empty, so it does not need a
full hashtable scan to determine the amount of empty buckets.
- to keep it simpler, I just filled the HashHeader struct with a
`char reserved[1024 - 32];`
1024 being the desired overall header size and 32 being the currently used size.
this alignment might be useful in case we mmap() the hashindex file one day.
reads all chunks in on-disk order and recompresses them if they are not already using
the desired compression type and level (and obfuscation level).
supports SIGINT/ctrl-c and --checkpoint-interval (default: 1800s).
this is a borg command that compacts when committing (without this, it would have
a huge space usage). it commits/compacts every checkpoint interval or when
pressing ctrl-c / receiving SIGINT.
we should modify the meta dict given by the caller, so the caller can know
about e.g. the compression/obfuscation that was done (this is useful for rcompress).
when using .scan(limit, marker), we used to use the last chunkid from
the previously returned scan result to remember how far we got and
from where we need to continue.
as this approach used the repo index to look up the respective segment/offset,
it was problematic if the code using scan was re-writing the chunk to
a new segment/offset, updating the repo index (e.g. when recompressing a chunk)
and basically destroying the memory about from where we need to continue
scanning.
thus, directly returning (segment, offset) as marker is easier and solves this issue.
otherwise, if we scan+get+put (e.g. if we read/modify/write chunks to
recompress them), it would scan past the last commit and run into the
newly written chunks (and potentially never terminate).
the intention of this test is testing whether borg check
returns an error when checking a corrupted repository.
the removed assertions were rather testing the test logging
configuration, which seems flaky:
- when running all tests, assertions failed
- when running only this one test, assertions succeeded
- assertions also succeeded when running all the tests before
they were refactored to separate test modules, although the
test code was not changed, just moved.
legacy: add/remove ctype/clevel bytes prefix of compressed data
new: use a separate metadata dict
compressors: use an int as ID, not a len 1 bytestring
borg < 2:
obj = encrypted(compressed(data))
borg 2:
obj = enc_meta_len32 + encrypted(msgpacked(meta)) + encrypted(compressed(data))
handle compr / decompr in repoobj
move the assert_id call from decrypt to RepoObj.parse
also:
- for AEADKeyBase, add a dummy assert_id (not needed here)
- only test assert_id for other if not AEADKeyBase instance
- remove test_getting_wrong_chunk. assert_id is called elsewhere
and is not needed any more anyway with the new AEAD crypto.
- only give manifest (includes key, repo, repo_objs)
- only return manifest from Manifest.load (includes key, repo, repo_objs)
- timezone aware timestamps
- str representation with +HHMM or +HH:MM
- get rid of to_locatime
- fix with_timestamp
- have archive start/end time always in local time with tz or as given
- idea: do not lose tz information
then we know when a backup was made and even from
which timezone it was made. if we want to compute
utc, we can do that using these infos.
this makes a quite nice archives list, with timestamps
as expected (in local time with timezone info).
at some places we just enforce utc, like for the
repo manifest timestamp or for the transaction log,
these are usually not looked at by the user.
since python 3.7, .isoformat() is usable IF timespec != "auto"
is given ("auto" [default] would be as evil as before, sometimes
formatting with, sometimes without microseconds).
also since python 3.7, there is now .fromisoformat().
implemented by introducing one level of indirection, the limit is now
very high, so it is not practically relevant any more.
we always use the indirection (storing the metadata stream chunk ids list not
directly into the archive item, but into some repo objects referenced by the new
ArchiveItem.item_ptrs list).
thus, the code behaves the same for all archive sizes.
borg2's new repo format does not need computing crc32 over big amounts of
(content) data any more (we now use xxh64 for that).
thus, having a quick crc32 implementation via libdeflate is not important
enough any more to rectify having libdeflate as a requirement.
These are legacy crypto modes based on AES-CTR mode:
(repokey|keyfile)[-blake2]
New crypto modes with session keys and AEAD ciphers:
(repokey|keyfile)[-blake2]-(aes-ocb|chacha20-poly1305)
Tests needed some changes:
- most used repokey/keyfile, changed to new modes
- some nonce tests removed, the new crypto code does not generate
the repo side nonces any more (were only used for AES-CTR)
This saves some segment file random IO that was previously necessary
just to determine the size of to be deleted data.
Keep old one as NSIndex1 for old borg compatibility.
Choose NSIndex or NSIndex1 based on repo index layout from HashHeader.
for an old repo index repo.get(key) returns segment, offset, None, None
see ticket and borg.helpers.msgpack docstring.
this changeset implements the full migration to
msgpack 2.0 spec (use_bin_type=True, raw=False).
still needed compat to the past is done via want_bytes decoder in borg.item.
borg now has the chunks list in every item with content.
due to the symmetric way how borg now deals with hardlinks using
item.hlid, processing gets much simpler.
but some places where borg deals with other "sources" of hardlinks
still need to do some hardlink management:
borg uses the HardLinkManager there now (which is not much more
than a dict, but keeps documentation at one place and avoids some
code duplication we had before).
item.hlid is computed via hardlink_id function.
support hardlinked symlinks, fixes#2379
as we use item.hlid now to group hardlinks together,
there is no conflict with the item.source usage for
symlink targets any more.
2nd+ hardlinks now add to the files count as did the 1st one.
for borg, now all hardlinks are created equal.
so any hardlink item with chunks now adds to the "file" count.
ItemFormatter: support {hlid} instead of {source} for hardlinks
the id must now always be given correctly because
the AEAD crypto modes authenticate the chunk id.
the special case when id == MANIFEST_ID is now handled
inside assert_id, so we never need to give a None id.
it potentially will ask for the passphrase for the key of OTHERREPO.
for the newly created repo, it will use the same passphrase.
it will copy: enc_key, enc_hmac_key, id_key, chunker_seed.
keeping the id_key (and id algorithm) and the chunker seed (and chunker
algorithm and parameters) is desirable for deduplication.
the id algorithm is usually either HMAC-SHA256 or BLAKE2b.
keeping the enc_key / enc_hmac_key must be implemented carefully:
A) AES-CTR -> AES-CTR is INSECURE due to nonce reuse, thus not allowed.
B) AES-CTR -> AEAD with session keys is secure.
C) AEAD with session keys -> AEAD with session keys is secure.
AEAD modes with session keys: AES-OCB and CHACHA20-POLY1305.
selftest imports testsuite.crypto
I did not realise this and imported pytest from testsuite.crypto
This broke the selftest.
Solution: move the tests that depend on pytest to testsuite.key.
All three affected tests are tests for the Key classes, so
this is probably a better plase for them anyway.
note: this required a slight increase of MAX_OBJECT_SIZE so that MAX_DATA_SIZE
could stay the same as before.
For PUT2, compute the hash over the whole entry (header and content, excluding
hash and crc32 fields, because the crc32 computation includes the hash).
Also: refactor crc32 checks into function, use f-strings, structure _read in
a more logical sequential order.
write_put: avoid creating a large temporary bytes object
why use xxh64?
- fast even without hw acceleration
- borg depends on it already anyway
- stronger than crc32 and strong enough for this purpose
Argon2 the second part: implement encryption/decryption of argon2 keys
borg init --key-algorithm=argon2 (new default, older pbkdf2 also still available)
borg key change-passphrase: keep key algorithm the same
borg key change-location: keep key algorithm the same
use env var BORG_TESTONLY_WEAKEN_KDF=1 to resource limit (cpu, memory, ...) the kdf when running the automated tests.
added a negative lookahead/lookbehind to make sure an ipv6 addr
(enclosed in square brackets) does not get badly matched by the
regex part intended for hostnames and ipv4 addrs only.
the other part of that regex which is actually intended to match
ipv6 addrs only matches if they are enclosed in square brackets.
also added tests for ssh and scp style repo URLs with ipv6 addrs
in brackets.
also: made regex more readable, putting these 2 cases on separate lines.
export-tar: just msgpack and b64encode all item metadata and
put that into a BORG specific PAX header.
this is *additional* to the standard tar metadata.
import-tar: when detecting the BORG specific PAX header, just get
all metadata from there (and ignore the standard tar
metadata).