Merge pull request #2530 from enkore/f/compact-revisit@2

Repository compaction docs
2024-12-26 01:37:20 +00:00 · 2017-05-22 13:55:04 +02:00 · 2017-05-22 13:55:04 +02:00 · 58791583d9
commit 58791583d9
parent 3721a8a3be f8c63f9a66
4 changed files with 51 additions and 8 deletions
--- a/docs/internals/compaction.png
+++ b/docs/internals/compaction.png
--- a/docs/internals/compaction.vsd
+++ b/docs/internals/compaction.vsd
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@ -122,11 +122,49 @@ such obsolete entries is called sparse, while a segment containing no such entri

 Since writing a ``DELETE`` tag does not actually delete any data and
 thus does not free disk space any log-based data store will need a
-compaction strategy.
+compaction strategy (somewhat analogous to a garbage collector).
+Borg uses a simple forward compacting algorithm,
+which avoids modifying existing segments.
+Compaction runs when a commit is issued (unless the :ref:`append_only_mode` is active).
+One client transaction can manifest as multiple physical transactions,
+since compaction is transacted, too, and Borg does not distinguish between the two::

-Borg tracks which segments are sparse and does a forward compaction
-when a commit is issued (unless the :ref:`append_only_mode` is
-active).
+  Perspective| Time -->
+  -----------+--------------
+  Client     | Begin transaction - Modify Data - Commit | <client waits for repository> (done)
+  Repository | Begin transaction - Modify Data - Commit | Compact segments - Commit   | (done)
+
+The compaction algorithm requires two inputs in addition to the segments themselves:
+
+(i) Which segments are sparse, to avoid scanning all segments (impractical).
+    Further, Borg uses a conditional compaction strategy: Only those
+    segments that exceed a threshold sparsity are compacted.
+
+    To implement the threshold condition efficiently, the sparsity has
+    to be stored as well. Therefore, Borg stores a mapping ``(segment
+    id,) -> (number of sparse bytes,)``.
+
+    The 1.0.x series used a simpler non-conditional algorithm,
+    which only required the list of sparse segments. Thus,
+    it only stored a list, not the mapping described above.
+(ii) Each segment's reference count, which indicates how many live objects are in a segment.
+     This is not strictly required to perform the algorithm. Rather, it is used to validate
+     that a segment is unused before deleting it. If the algorithm is incorrect, or the reference
+     count was not accounted correctly, then an assertion failure occurs.
+
+These two pieces of information are stored in the hints file (`hints.N`)
+next to the index (`index.N`).
+
+When loading a hints file, Borg checks the version contained in the file.
+The 1.0.x series writes version 1 of the format (with the segments list instead
+of the mapping, mentioned above). Since Borg 1.0.4, version 2 is read as well.
+The 1.1.x series writes version 2 of the format and reads either version.
+When reading a version 1 hints file, Borg 1.1.x will
+read all sparse segments to determine their sparsity.
+
+This process may take some time if a repository is kept in the append-only mode,
+which causes the number of sparse segments to grow. Repositories not in append-only
+mode have no sparse segments in 1.0.x, since compaction is unconditional.

 Compaction processes sparse segments from oldest to newest; sparse segments
 which don't contain enough deleted data to justify compaction are skipped. This
@ -135,8 +173,14 @@ a couple kB were deleted in a segment.

 Segments that are compacted are read in entirety. Current entries are written to
 a new segment, while superseded entries are omitted. After each segment an intermediary
-commit is written to the new segment, data is synced and the old segment is deleted --
-freeing disk space.
+commit is written to the new segment. Then, the old segment is deleted
+(asserting that the reference count diminished to zero), freeing disk space.
+
+A simplified example (excluding conditional compaction and with simpler
+commit logic) showing the principal operation of compaction:
+
+.. figure::
+    compaction.png

 (The actual algorithm is more complex to avoid various consistency issues, refer to
 the ``borg.repository`` module for more comments and documentation on these issues.)
--- a/src/borg/constants.py
+++ b/src/borg/constants.py
@ -31,8 +31,7 @@
 # the header, and the total size was set to 20 MiB).
 MAX_DATA_SIZE = 20971479

-# A few hundred files per directory to go easy on filesystems which don't like too many files per dir (NTFS)
-DEFAULT_SEGMENTS_PER_DIR = 500
+DEFAULT_SEGMENTS_PER_DIR = 2000

 CHUNK_MIN_EXP = 19  # 2**19 == 512kiB
 CHUNK_MAX_EXP = 23  # 2**23 == 8MiB