reorder to be more logical and more gentle

2025-01-24 08:19:03 +00:00 · 2014-12-16 10:35:48 -05:00 · 2014-12-16 10:35:48 -05:00 · 688ba109ef
commit 688ba109ef
parent ddca3b856b
1 changed files with 55 additions and 49 deletions
--- a/docs/internals.rst
+++ b/docs/internals.rst
@ -33,43 +33,25 @@ index.%d
  cache of the file indexes. those files can be regenerated with
  ``check --repair``

-Indexes and memory usage
------------------------
+Repository config file
+----------------------

-Repository index
-  40 bytes x N ~ 200MB (If a remote repository is
-  used this will be allocated on the remote side)
+Each repository has a ``config`` file which which is a ``INI``
+formatted file which looks like this:

-Chunk lookup index
-  44 bytes x N ~ 220MB
+  [repository]
+  version = 1
+  segments_per_dir = 10000
+  max_segment_size = 5242880
+  id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6

-File chunk cache
-  probably 80-100 bytes x N ~ 400MB
+This is where the ``repository.id`` is stored. It is a unique
+identifier for repositories. It will not change if you move the
+repository around so you can make a local transfer then decide to move
+the repository in another (even remote) location at a later time.

-The chunk lookup index (chunk hash -> reference count, size, ciphered
-size ; in file cache/chunk) and the repository index (chunk hash ->
-segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
-table, directly mapped in memory from the file content, with only one
-slot per bucket, but that spreads the collisions to the following
-buckets. As a consequence the hash is just a start position for a linear
-search, and if the element is not in the table the index is linearly
-crossed until an empty bucket is found. When the table is full at 90%
-its size is doubled, when it's empty at 25% its size is halfed. So
-operations on it have a variable complexity between constant and linear
-with low factor, and memory overhead varies between 10% and 300%.
-
-The file chunk cache (file path hash -> age, inode number, size,
-mtime_ns, chunks hashes ; in file cache/files) is stored as a python
-associative array storing python objects, which generate a lot of
-overhead. This takes around 240 bytes per file without the chunk
-list, to be compared to at most 64 bytes of real data (depending on data
-alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
-of ~250 bytes even if only one chunck hash. The inode number is stored
-to make sure we distinguish between different files, as a single path
-may not be unique accross different archives in different setups.
-
-The ``index.%d`` files are random access but those files can be
-recreated if damaged or lost using ``check --repair``.
+|project_name| will do a POSIX read lock on that file when operating
+on the repository.

 Repository structure
 --------------------
@ -115,7 +97,7 @@ beyond the ``MAX_OBJECT_SIZE`` barrier of 20MB.
 A chunk is an object as well, of course, and its id is the hash of its
 (unencrypted and uncompressed) content.

-Hints are stored in a file (repo/hints) and contain: version, list of
+Hints are stored in a file (``repo/hints``) and contain: version, list of
 segments, compact.

 Chunks
@ -128,25 +110,49 @@ average. All these parameters are fixed. The buzhash table is altered
 by XORing it with a seed randomly generated once for the archive, and
 stored encrypted in the keyfile.

-Repository config file
----------------------
+Indexes
+-------

-Each repository has a ``config`` file which which is a ``INI``
-formatted file which looks like this:
+The chunk lookup index (chunk hash -> reference count, size, ciphered
+size ; in file cache/chunk) and the repository index (chunk hash ->
+segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
+table, directly mapped in memory from the file content, with only one
+slot per bucket, but that spreads the collisions to the following
+buckets. As a consequence the hash is just a start position for a linear
+search, and if the element is not in the table the index is linearly
+crossed until an empty bucket is found. When the table is full at 90%
+its size is doubled, when it's empty at 25% its size is halfed. So
+operations on it have a variable complexity between constant and linear
+with low factor, and memory overhead varies between 10% and 300%.

-  [repository]
-  version = 1
-  segments_per_dir = 10000
-  max_segment_size = 5242880
-  id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
+The file chunk cache (file path hash -> age, inode number, size,
+mtime_ns, chunks hashes ; in file cache/files) is stored as a python
+associative array storing python objects, which generate a lot of
+overhead. This takes around 240 bytes per file without the chunk
+list, to be compared to at most 64 bytes of real data (depending on data
+alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
+of ~250 bytes even if only one chunck hash. The inode number is stored
+to make sure we distinguish between different files, as a single path
+may not be unique accross different archives in different setups.

-This is where the ``repository.id`` is stored. It is a unique
-identifier for repositories. It will not change if you move the
-repository around so you can make a local transfer then decide to move
-the repository in another (even remote) location at a later time.
+The ``index.%d`` files are random access but those files can be
+recreated if damaged or lost using ``check --repair``.

-|project_name| will do a POSIX read lock on that file when operating
-on the repository.
+Indexes memory usage
+--------------------
+
+Here is the estimated memory usage of |project_name| when using those
+indexes:
+
+Repository index
+  40 bytes x N ~ 200MB (If a remote repository is
+  used this will be allocated on the remote side)
+
+Chunk lookup index
+  44 bytes x N ~ 220MB
+
+File chunk cache
+  probably 80-100 bytes x N ~ 400MB

 Encryption
 ----------