support reading new, improved hashindex header format, fixes#6960
Bit of a pain to work with that code:
- C code
- needs to still be able to read the old hashindex file format,
- while also supporting the new file format.
- the hash computed while reading the file causes additional problems because
it expects all places in the file get read exactly once and in sequential order.
I solved this by separately opening the file in the python part of the code and
checking for the magic.
BORG_IDX means the legacy file format and legacy layout of the hashtable,
BORG2IDX means the new file format and the new layout of the hashtable.
Done:
- added a version int32 directly after the magic and set it to 2 (like borg 2).
the old header had no version info, but could be denoted as version 1 in case
we ever need it (currently it decides based on the magic).
- added num_empty as indicated by a TODO in count_empty, so it does not need a
full hashtable scan to determine the amount of empty buckets.
- to keep it simpler, I just filled the HashHeader struct with a
`char reserved[1024 - 32];`
1024 being the desired overall header size and 32 being the currently used size.
this alignment might be useful in case we mmap() the hashindex file one day.
warning: src/borg/item.pyx:199:10: cpdef variables will not be supported in Cython 3; currently they are no different from cdef variables
warning: src/borg/item.pyx:200:10: cpdef variables will not be supported in Cython 3; currently they are no different from cdef variables
warning: src/borg/item.pyx:202:10: cpdef variables will not be supported in Cython 3; currently they are no different from cdef variables
this turns all python level classes into extension type classes.
additionally it turns the indirect properties into direct descriptors.
test_propdict_attributes runs about 30% faster.
base memory usage as reported by sys.getsizeof(Item()):
before: 48 bytes, after this PR: 40 bytes
Author: @RonnyPfannschmidt in PR #5763
reads all chunks in on-disk order and recompresses them if they are not already using
the desired compression type and level (and obfuscation level).
supports SIGINT/ctrl-c and --checkpoint-interval (default: 1800s).
this is a borg command that compacts when committing (without this, it would have
a huge space usage). it commits/compacts every checkpoint interval or when
pressing ctrl-c / receiving SIGINT.
we should modify the meta dict given by the caller, so the caller can know
about e.g. the compression/obfuscation that was done (this is useful for rcompress).
some new stuff is not supported for NSIndex1,
but we can avoid crashing due to function signature mismatches or
missing methods and rather have more clear exceptions.
when using .scan(limit, marker), we used to use the last chunkid from
the previously returned scan result to remember how far we got and
from where we need to continue.
as this approach used the repo index to look up the respective segment/offset,
it was problematic if the code using scan was re-writing the chunk to
a new segment/offset, updating the repo index (e.g. when recompressing a chunk)
and basically destroying the memory about from where we need to continue
scanning.
thus, directly returning (segment, offset) as marker is easier and solves this issue.
otherwise, if we scan+get+put (e.g. if we read/modify/write chunks to
recompress them), it would scan past the last commit and run into the
newly written chunks (and potentially never terminate).