test hashtable expansion/rebuild.
hashindex_lookup:
- return -2 for a compact / completely full hashtable
- return -1 and (via start_idx pointer) the deleted/tombstone bucket index.
fix size assertion (we add 1 element to trigger rebuild)
fix upper_limit check - since we'll be adding 1 to num_entries below,
the condition should be >=:
hashindex_compact: set min_empty/upper_limit
Co-authored-by: Dan Christensen <jdc+github@uwo.ca>
hashindex_index returns the perfect hashtable index, but does not
check what's in the bucket there, so we had these loops afterwards
to search for an empty or deleted bucket.
problem: if the HT were completely filled with no empty and no deleted
buckets, that loop would never end. due to our HT resizing, it can
never happen, but still not pretty.
when using hashindex_lookup (as also used some lines above), the code
is easier to understand, because (after we resized the HT), we freshly
create the same situation as after the first call of that function:
- return value < 0, because we (still) can not find the key
- start_idx will point to an empty bucket
Thus, we do not need the problematic loops we had there.
Modified the checks to make sure we really have an empty or deleted
bucket before overwriting it with data.
Added some additional asserts to make sure the code behaves.
support reading new, improved hashindex header format, fixes#6960
Bit of a pain to work with that code:
- C code
- needs to still be able to read the old hashindex file format,
- while also supporting the new file format.
- the hash computed while reading the file causes additional problems because
it expects all places in the file get read exactly once and in sequential order.
I solved this by separately opening the file in the python part of the code and
checking for the magic.
BORG_IDX means the legacy file format and legacy layout of the hashtable,
BORG2IDX means the new file format and the new layout of the hashtable.
Done:
- added a version int32 directly after the magic and set it to 2 (like borg 2).
the old header had no version info, but could be denoted as version 1 in case
we ever need it (currently it decides based on the magic).
- added num_empty as indicated by a TODO in count_empty, so it does not need a
full hashtable scan to determine the amount of empty buckets.
- to keep it simpler, I just filled the HashHeader struct with a
`char reserved[1024 - 32];`
1024 being the desired overall header size and 32 being the currently used size.
this alignment might be useful in case we mmap() the hashindex file one day.
The value argument of hashindex_set is causing harmless pointer type
mismatches. This resolves the issue by changing the type to void*
which silences these types of warnings.
the problem was that after a resize that was triggered by too few
empty buckets, the rebuilt new hash table had entries at different
positions than before, but the idx where to SET the entry was not
recomputed afterwards.
this is needed for correctness because the preprocessor is just
doing text replacement.
This is the correct way:
#define MUL(x, y) ((x) * (y))
MUL(1+2, 3-4) -> ((1+2) * (3-4)) # not: (1+2 * 3-4)
I didn't put parens around all arg usages for readability.
Some stuff (like index) is not expected to be an expression.
Also, when a arg is only used in another macro or function call,
no parens are needed either.
I reviewed the code: no harm was done (yet) due to this fault.
Thanks to @rciorba who found this.
Integer division is slow, and this improves the speed of all operations on the hashmap.
Benchmarked this patch on the rciorba/master-bench branch:
9e5d61e03c/results.html
Py_XDECREF and friends are explicitly written to use op
only once in CPython (and other code relies on this,
Py_XDECREF(something()) is fairly common), but other
implementations don't guarantee this.
So, let's make a rule: don't pass side effects into macros, full stop.
This is a (relatively) simple state machine running in the
data callbacks invoked by the msgpack unpacking stack machine
(the same machine is used in msgpack-c and msgpack-python,
changes are minor and cosmetic, e.g. removal of msgpack_unpack_object,
removal of the C++ template thus porting to C and so on).
Compared to the previous solution this has multiple advantages
- msgpack-c dependency is removed
- this approach is faster and requires fewer and smaller
memory allocations
Testability of the two solutions does not differ in my
professional opinion(tm).
Two other changes were rolled up; _hashindex.c can be compiled
without Python.h again (handy for fuzzing and testing);
a "small" bug in the cache sync was fixed which allocated too
large archive indices, leading to excessive archive.chunks.d
disk usage (that actually gave me an idea).
if there are too many deleted buckets (tombstones), hashtable performance goes down the drain.
in the worst case of 0 empty buckets and lots of tombstones, this results in full table scans for
new / unknown keys.
thus we make sure we always have a good amount of empty buckets.
hashindex_lookup would always hint at skipping whatever it's probe
length had been with no regard for tombstones it had encountered. This
meant new keys would not overwrite first tombstones, but would always
land on empty buckets.
The regression was introduced in #1748