we now just treat that one .borg_part file we might have inside
checkpoint archives as a normal file.
people can recognize via the file name it is a partial file.
nobody cares for statistics of checkpoint files and the final
archive now does not contain any partial files any more, thus
no needs to maintain statistics about count and size of part
files.
support reading new, improved hashindex header format, fixes#6960
Bit of a pain to work with that code:
- C code
- needs to still be able to read the old hashindex file format,
- while also supporting the new file format.
- the hash computed while reading the file causes additional problems because
it expects all places in the file get read exactly once and in sequential order.
I solved this by separately opening the file in the python part of the code and
checking for the magic.
BORG_IDX means the legacy file format and legacy layout of the hashtable,
BORG2IDX means the new file format and the new layout of the hashtable.
Done:
- added a version int32 directly after the magic and set it to 2 (like borg 2).
the old header had no version info, but could be denoted as version 1 in case
we ever need it (currently it decides based on the magic).
- added num_empty as indicated by a TODO in count_empty, so it does not need a
full hashtable scan to determine the amount of empty buckets.
- to keep it simpler, I just filled the HashHeader struct with a
`char reserved[1024 - 32];`
1024 being the desired overall header size and 32 being the currently used size.
this alignment might be useful in case we mmap() the hashindex file one day.
some new stuff is not supported for NSIndex1,
but we can avoid crashing due to function signature mismatches or
missing methods and rather have more clear exceptions.
This saves some segment file random IO that was previously necessary
just to determine the size of to be deleted data.
Keep old one as NSIndex1 for old borg compatibility.
Choose NSIndex or NSIndex1 based on repo index layout from HashHeader.
for an old repo index repo.get(key) returns segment, offset, None, None
the .get() like behaviour (== returning the value) was missing.
it's still not 100% like dict.setdefault, because there is no
default value None. but None doesn't make sense here, because we
usually need a N-tuple matching the hash table's value format.
note: this "bug" (or unusual implementation) was without consequences,
because hashindex.setdefault is not used anywhere in borg, so
it was also not used in a wrong way anywhere.
https://docs.python.org/3/library/stdtypes.html#dict.setdefault
This is a (relatively) simple state machine running in the
data callbacks invoked by the msgpack unpacking stack machine
(the same machine is used in msgpack-c and msgpack-python,
changes are minor and cosmetic, e.g. removal of msgpack_unpack_object,
removal of the C++ template thus porting to C and so on).
Compared to the previous solution this has multiple advantages
- msgpack-c dependency is removed
- this approach is faster and requires fewer and smaller
memory allocations
Testability of the two solutions does not differ in my
professional opinion(tm).
Two other changes were rolled up; _hashindex.c can be compiled
without Python.h again (handy for fuzzing and testing);
a "small" bug in the cache sync was fixed which allocated too
large archive indices, leading to excessive archive.chunks.d
disk usage (that actually gave me an idea).
also: add some missing assertion messages
severity:
- no issue on little-endian platforms (== most, including x86/x64)
- harmless even on big-endian as long as refcount is below 0xfffbffff,
which is very likely always the case in practice anyway.