Commit Graph

48 Commits

Author SHA1 Message Date
Dan Christensen 8eff0b2f02 _hashindex.c: fix comment 2023-02-12 08:24:27 -05:00
Dan Christensen 36aa395e49 _hashindex.c: set min_empty and num_empty even when permit_compact=True 2023-02-11 20:11:09 -05:00
Dan Christensen 60bea46eb7 _hashindex.c: rewrite hashindex_compact 2023-02-11 20:10:02 -05:00
Thomas Waldmann 3d57dc0590
hashindex: add tests, misc. minor fixes/changes
test hashtable expansion/rebuild.

hashindex_lookup:
- return -2 for a compact / completely full hashtable
- return -1 and (via start_idx pointer) the deleted/tombstone bucket index.

fix size assertion (we add 1 element to trigger rebuild)

fix upper_limit check - since we'll be adding 1 to num_entries below,
the condition should be >=:

hashindex_compact: set min_empty/upper_limit

Co-authored-by: Dan Christensen <jdc+github@uwo.ca>
2023-02-09 22:14:07 +01:00
Thomas Waldmann 316bb7c937
add num_entries assertion 2023-02-08 22:45:11 +01:00
Thomas Waldmann b79766a933
hashindex: simplify size_idx function
Thanks to @jdchristensen for the code.
2023-02-08 22:11:49 +01:00
Thomas Waldmann a13d53ec1e
Simplify full HT scan assertion 2023-02-08 22:11:47 +01:00
Thomas Waldmann 4fc7815f11
hashindex: always have at least 1 empty bucket
avoid rounding / integer conversion issues bringing this down to 0.
2023-02-08 22:11:44 +01:00
Thomas Waldmann 4f8c4aea19
implement ht idx wrap around less strangely, add comment 2023-02-08 22:11:43 +01:00
Thomas Waldmann 907da00931
if HT is full with entries and tombstones: give up/fail early 2023-02-08 22:11:41 +01:00
Thomas Waldmann 0098ac9e63
more comments for hashindex_lookup 2023-02-08 22:11:34 +01:00
Thomas Waldmann 9bf352d00c
bugfix: do not resize hashindex with wrong num_empty
otherwise we would lose the decrement operation on num_empty.
2023-02-08 22:10:50 +01:00
Thomas Waldmann 83e6b4269e
hashindex: simplify assert 2023-02-08 16:25:15 +01:00
Thomas Waldmann c9573c04ac
_hashindex: easier to understand code, dubious loops removed, asserts
hashindex_index returns the perfect hashtable index, but does not
check what's in the bucket there, so we had these loops afterwards
to search for an empty or deleted bucket.

problem: if the HT were completely filled with no empty and no deleted
buckets, that loop would never end. due to our HT resizing, it can
never happen, but still not pretty.

when using hashindex_lookup (as also used some lines above), the code
is easier to understand, because (after we resized the HT), we freshly
create the same situation as after the first call of that function:
- return value < 0, because we (still) can not find the key
- start_idx will point to an empty bucket

Thus, we do not need the problematic loops we had there.

Modified the checks to make sure we really have an empty or deleted
bucket before overwriting it with data.

Added some additional asserts to make sure the code behaves.
2023-02-08 14:51:54 +01:00
TW c29d4a096b
Hashindex header work, fixes #6960 (#7064)
support reading new, improved hashindex header format, fixes #6960

Bit of a pain to work with that code:
- C code
- needs to still be able to read the old hashindex file format,
- while also supporting the new file format.
- the hash computed while reading the file causes additional problems because
  it expects all places in the file get read exactly once and in sequential order.
  I solved this by separately opening the file in the python part of the code and
  checking for the magic.
  BORG_IDX means the legacy file format and legacy layout of the hashtable,
  BORG2IDX means the new file format and the new layout of the hashtable.

Done:
- added a version int32 directly after the magic and set it to 2 (like borg 2).
  the old header had no version info, but could be denoted as version 1 in case
  we ever need it (currently it decides based on the magic).
- added num_empty as indicated by a TODO in count_empty, so it does not need a
  full hashtable scan to determine the amount of empty buckets.
- to keep it simpler, I just filled the HashHeader struct with a
  `char reserved[1024 - 32];`
  1024 being the desired overall header size and 32 being the currently used size.
  this alignment might be useful in case we mmap() the hashindex file one day.
2022-10-02 14:35:21 +02:00
Thomas Waldmann 350393c9fd remove unused imports 2022-07-05 00:05:07 +02:00
Thomas Waldmann 2157a35212 hashindex_compact: fix eval order (check idx before use), fixes #5899
also: fix "off by one" comment
2022-06-29 20:12:29 +02:00
James Buren 5f3d61e2f0 src/borg/_hashindex.c: fix compiler warnings
The value argument of hashindex_set is causing harmless pointer type
mismatches. This resolves the issue by changing the type to void*
which silences these types of warnings.
2022-02-26 14:02:29 -06:00
Thomas Waldmann fa63150e14 fix bug in hashindex_set on resize, fixes #4829
the problem was that after a resize that was triggered by too few
empty buckets, the rebuilt new hash table had entries at different
positions than before, but the idx where to SET the entry was not
recomputed afterwards.
2020-03-01 17:47:12 +01:00
Thomas Waldmann 7e7ea69e92 add comment about hashtable sizes, fixes #2830 2019-03-20 14:35:05 +01:00
motwok 78a3f2475e Basic MSC Compatibility (#4147)
MSC Compatibility (Native Compile on Windows)
2018-11-03 18:52:54 +01:00
Emmo Emminghaus b2c3bc4d9e removing accidental change 2018-11-02 14:10:23 +01:00
Emmo Emminghaus 718abfd6f8 cleanup void * in _hashindex.c (2nd) 2018-10-27 17:45:10 +02:00
Emmo Emminghaus 9d1276af04 cleanup void * in _hashindex.c 2018-10-27 17:34:25 +02:00
Thomas Waldmann ce4d248e0b add debug_print macro for hashindex debugging, see #3807 2018-05-19 22:26:56 +02:00
Thomas Waldmann a3cecf599f add parens for C preprocessor macro argument usages
this is needed for correctness because the preprocessor is just
doing text replacement.

This is the correct way:

#define MUL(x, y) ((x) * (y))

MUL(1+2, 3-4) -> ((1+2) * (3-4))    # not: (1+2 * 3-4)

I didn't put parens around all arg usages for readability.

Some stuff (like index) is not expected to be an expression.

Also, when a arg is only used in another macro or function call,
no parens are needed either.

I reviewed the code: no harm was done (yet) due to this fault.

Thanks to @rciorba who found this.
2017-12-13 04:01:59 +01:00
Radu Ciorba 12e0f55991 replace modulo with if to check for wraparound in hashmap
Integer division is slow, and this improves the speed of all operations on the hashmap.

Benchmarked this patch on the rciorba/master-bench branch:
9e5d61e03c/results.html
2017-07-20 13:22:34 +03:00
Marian Beermann 019a258709 create _endian.h 2017-07-11 19:12:19 +02:00
Marian Beermann 9a856533ba fuse: versions view, linear numbering by archive time 2017-07-03 12:38:10 +02:00
Marian Beermann 9827578df5 hashindex: don't pass side effect into macro
Py_XDECREF and friends are explicitly written to use op
only once in CPython (and other code relies on this,
Py_XDECREF(something()) is fairly common), but other
implementations don't guarantee this.

So, let's make a rule: don't pass side effects into macros, full stop.
2017-07-02 15:20:52 +02:00
Thomas Waldmann 72ef24cbc0 hashindex: implement KeyError 2017-06-17 23:25:32 +02:00
Marian Beermann 783a5926d6 cache sync: introduce BORG_NO_PYTHON
textshell edition
2017-06-11 20:23:17 +02:00
enkore 13f396d5ad Merge pull request #2638 from enkore/f/fastcachesync-minify
Compact chunks.archive.d
2017-06-10 17:13:25 +02:00
enkore d06ee5648c Merge pull request #2572 from enkore/f/fastcachesync
Improve cache sync speed
2017-06-10 17:12:51 +02:00
Marian Beermann 1d5d50463c hashindex_compact: use memmove for possibly overlapping copy 2017-06-09 12:23:27 +02:00
Marian Beermann 6e011b9354 cache: compact hashindex before writing to chunks.archive.d 2017-06-09 12:23:26 +02:00
Marian Beermann b75c214af5 hashindex: Cython defines PY_SSIZE_T_CLEAN 2017-06-08 15:16:52 +02:00
Marian Beermann c786a5941e CacheSynchronizer: redo as quasi FSM on top of unpack.h
This is a (relatively) simple state machine running in the
data callbacks invoked by the msgpack unpacking stack machine
(the same machine is used in msgpack-c and msgpack-python,
changes are minor and cosmetic, e.g. removal of msgpack_unpack_object,
removal of the C++ template thus porting to C and so on).

Compared to the previous solution this has multiple advantages
- msgpack-c dependency is removed
- this approach is faster and requires fewer and smaller
  memory allocations

Testability of the two solutions does not differ in my
professional opinion(tm).

Two other changes were rolled up; _hashindex.c can be compiled
without Python.h again (handy for fuzzing and testing);
a "small" bug in the cache sync was fixed which allocated too
large archive indices, leading to excessive archive.chunks.d
disk usage (that actually gave me an idea).
2017-06-02 17:43:15 +02:00
Marian Beermann d463dd89aa hashindex: read/write: use hash_part for HashHeader 2017-05-25 17:44:01 +02:00
enkore 6cd7d415ca hashindex: Use Python I/O (#2496)
- Preparation for #1688 / #1101
- Support hash indices >2 GB
- Better error reporting
2017-05-09 21:30:14 +02:00
Thomas Waldmann fd0649767a hashindex: rebuild hashtable if we have too little empty buckets, fixes #2246
if there are too many deleted buckets (tombstones), hashtable performance goes down the drain.
in the worst case of 0 empty buckets and lots of tombstones, this results in full table scans for
new / unknown keys.
thus we make sure we always have a good amount of empty buckets.
2017-03-04 03:09:18 +01:00
TW c6ea34be96 Merge pull request #2111 from ThomasWaldmann/merge-1.0-maint
Merge 1.0-maint
2017-02-01 12:13:37 +01:00
Radu Ciorba a85cf75465 fix wrong skip_hint on hashindex_set when encountering tombstones
hashindex_lookup would always hint at skipping whatever it's probe
length had been with no regard for tombstones it had encountered. This
meant new keys would not overwrite first tombstones, but would always
land on empty buckets.

The regression was introduced in #1748
2017-01-30 23:29:08 +02:00
Thomas Waldmann c0dc644ef6 Merge branch '1.0-maint' into merge-1.0-maint
# Conflicts:
#	MANIFEST.in
#	Vagrantfile
#	docs/changes.rst
#	docs/usage/mount.rst.inc
#	src/borg/archiver.py
#	src/borg/fuse.py
#	src/borg/repository.py
2017-01-29 05:49:53 +01:00
Radu Ciorba 766a43afc3 shortcut hashindex_set by having hashindex_lookup hint about address 2016-10-22 12:33:15 +03:00
Marian Beermann e9a73b808f Check for sufficient free space before committing 2016-07-30 00:04:27 +02:00
Thomas Waldmann 3baa8a3728 Merge branch '1.0-maint'
# Conflicts:
#	docs/changes.rst
#	docs/usage/mount.rst.inc
#	src/borg/archive.py
#	src/borg/archiver.py
#	src/borg/fuse.py
#	src/borg/testsuite/archiver.py
2016-07-11 01:23:27 +02:00
Thomas Waldmann d1ea925a5b move borg package to src/ 2016-05-05 20:19:50 +02:00
Renamed from borg/_hashindex.c (Browse further)