Worst (but frequent) case here is that all or most of the chunks
in the repo need to get recompressed, thus storing all chunk ids
in a python list would need significant amounts of memory for
large repositories.
We already have all chunk ids stored in cache.chunks, so we now just
flag the ones needing re-compression by setting the F_COMPRESS flag
(that does not need any additional memory).
- ChunkIndex: implement system flags
- ChunkIndex: F_NEW flag as 1st system flag for newly added chunks
- incrementally write only NEW chunks to repo/cache/chunks.*
- merge all chunks.* when loading the ChunkIndex from the repo
Also: the cached ChunkIndex only has the chunk IDs. All values are just dummies.
The ChunkIndexEntry value can be used to set flags and track size, but we
intentionally do not persist flags and size to the cache.
The size information gets set when borg loads the files cache and "compresses"
the chunks lists in the files cache entries. After that, all chunks referenced
by the files cache will have a valid size as long as the ChunkIndex is in memory.
This is needed so that "uncompress" can work.
- doesn't need a separate file for the hash
- we can later write multiple partial chunkindexes to the cache
also:
add upgrade code that renames the cache from previous borg versions.
Consider soft-deleted archives/ directory entries, but only create a new
archives/ directory entry if:
- there is no entry for that archive ID
- there is no soft-deleted entry for that archive ID either
Support running with or without --repair.
Without --repair, it can be used to detect such inconsistencies and return with rc != 0.
--repository-only contradicts --find-lost-archives.
We are only interested in archive metadata objects here, thus for most repo objects
it is enough to read the repoobj's metadata and determine the object's type.
Only if it is the right type of object, we need to read the full object (metadata
and data).
This reverts commit d3f3082bf4.
Comment by jdchristensen:
I agree that "wipe clean" is correct grammar, but it doesn't match the situation in "unmount cleanly".
The change in this patch is definitely wrong.
Putting it another way, one would never say that we "clean unmount a filesystem".
We say that we "cleanly unmount a filesystem", or in other words, that it "unmounts cleanly".
But the original text is slightly awkward, so I would propose: "When running in the foreground,
^C/SIGINT cleanly unmounts the filesystem, but other signals or crashes do not."
(Not that this guarantees anything, but I'm a native speaker.)
We gave up refcounting quite a while ago and are only interested
in whether a chunk is used (referenced) or not (orphan).
So, let's keep that uint32_t value, but use it for bit flags, so
we could use it to efficiently remember other chunk-related stuff also.
If we have an entry for a chunk id in the ChunkIndex,
it means that this chunk exists in the repository.
The code was a bit over-complicated and used entry.refcount
only to detect whether .get(id, default) actually got something
from the ChunkIndex or used the provided default value.
The code does the same now, but in a simpler way.
Additionally, it checks for size consistency if a size is
provided by the caller and a size is already present in
the entry.
- refactor packing/unpacking of fc entries into separate functions
- instead of a chunks list entry being a tuple of 256bit id [bytes] and 32bit size [int],
only store a stable 32bit index into kv array of ChunkIndex (where we also have id and
size [and refcount]).
- only done in memory, the on-disk format has (id, size) tuples.
memory consumption (N = entry.chunks list element count, X = overhead for rest of entry):
- previously:
- packed = packb(dict(..., chunks=[(id1, size1), (id2, size2), ...]))
- packed size ~= X + N * (1 + (34 + 5)) Bytes
- now:
- packed = packb(dict(..., chunks=[ix1, ix2, ...]))
- packed size ~= X + N * 5 Bytes