restic/doc/Design.md

379 lines
17 KiB
Markdown
Raw Normal View History

2015-04-07 18:09:25 +00:00
This document gives a high-level overview of the design and repository layout
of the restic backup program.
2015-01-02 17:43:59 +00:00
2015-01-02 21:45:02 +00:00
Repository Format
=================
2015-01-02 17:43:59 +00:00
2015-04-07 17:57:06 +00:00
All data is stored in a restic repository. A repository is able to store data
in blobs of several different types, which can later be requested based on an
ID. The ID is the hash (SHA-256) of the content of a blob. All blobs in a
repository are only written once and never modified afterwards. This allows
accessing and even writing to the repository with multiple clients in parallel.
Only the delete operation changes data in the repository.
2015-01-03 14:24:47 +00:00
At the time of writing, the only implemented repository type is based on
directories and files. Such repositories can be accessed locally on the same
system or via the integrated SFTP client. The directory layout is the same for
both access methods. This repository type is described in the following.
Repositories consists of several directories and a file called `version`. This
file contains the version number of the repository. At the moment, this file
is expected to hold the string `1`, with an optional newline character.
2015-04-07 17:57:06 +00:00
Additionally there is a file named `id` which contains 32 random bytes, encoded
in hexadecimal. This uniquely identifies the repository, regardless if it is
accessed via SFTP or locally.
2015-01-03 14:24:47 +00:00
For all other blobs stored in the repository, the name for the file is the
lower case hexadecimal representation of the SHA-256 hash of the file's
contents. This allows easily checking all files for accidental modifications
like disk read errors by simply running the program `sha256sum` and comparing
its output to the file name. If the prefix of a filename is unique amongst all
the other files in the same directory, the prefix may be used instead of the
complete filename.
2015-04-07 17:57:06 +00:00
Apart from the files `version`, `id` and the files stored below the `keys`
directory, all files are encrypted with AES-256 in counter mode (CTR). The
integrity of the encrypted data is secured by an Poly1305-AES signature.
2015-01-03 14:24:47 +00:00
In the first 16 bytes of each encrypted file the initialisation vector (IV) is
stored. It is followed by the encrypted data and completed by the 16 byte MAC
signature. The format is: `IV || CIPHERTEXT || MAC`. The complete encryption
2015-03-22 21:46:38 +00:00
overhead is 32 byte. For each file, a new random IV is selected.
2015-01-02 17:43:59 +00:00
The basic layout of a sample restic repository is shown below:
/tmp/restic-repo
├── data
│ ├── 59
│ │ └── 59fe4bcde59bd6222eba87795e35a90d82cd2f138a27b6835032b7b58173a426
│ ├── 73
│ │ └── 73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c
│ [...]
2015-04-07 17:57:06 +00:00
├── id
2015-04-08 19:42:13 +00:00
├── index
│ └── c38f5fb68307c6a3e3aa945d556e325dc38f5fb68307c6a3e3aa945d556e325d
2015-01-02 17:43:59 +00:00
├── keys
│ └── b02de829beeb3c01a63e6b25cbd421a98fef144f03b9a02e46eff9e2ca3f0bd7
├── locks
├── snapshots
│ └── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec
├── tmp
├── trees
│ ├── 21
│ │ └── 2159dd48f8a24f33c307b750592773f8b71ff8d11452132a7b2e2a6a01611be1
│ ├── 32
│ │ └── 32ea976bc30771cebad8285cd99120ac8786f9ffd42141d452458089985043a5
│ ├── 95
│ │ └── 95f75feb05a7cc73e328b2efa668b1ea68f65fece55a93bc65aff6cd0bcfeefc
2015-01-17 15:57:16 +00:00
│ ├── b8
│ │ └── b8138ab08a4722596ac89c917827358da4672eac68e3c03a8115b88dbf4bfb59
│ ├── e0
│ │ └── e01150928f7ad24befd6ec15b087de1b9e0f92edabd8e5cabb3317f8b20ad044
│ [...]
2015-01-02 17:43:59 +00:00
└── version
2015-01-02 21:45:02 +00:00
A repository can be initialized with the `restic init` command, e.g.:
$ restic -r /tmp/restic-repo init
2015-01-02 21:45:02 +00:00
2015-04-08 19:42:13 +00:00
Blob Format
-----------
All blobs except key, tree and data blobs just contain raw data, stored as `IV
|| Ciphertext || MAC`. Tree and Data blobs may contain several chunks of data.
The format is described in the following.
The blob starts with a nonce and a header, the header describes the content and
is encrypted and signed. The blob's structure is as follows:
NONCE || Header_Length ||
IV_Header || Ciphertext_Header || MAC_Header ||
IV_Chunk_1 || Ciphertext_Chunk_1 || MAC_Chunk_1 ||
[...]
IV_Chunk_n || Ciphertext_Chunk_n || MAC_Chunk_n ||
MAC
`NONCE` consists of 16 bytes and `Header_Length` is a four byte integer in
little-endian encoding.
All the parts (`Ciphertext_Header`, `Ciphertext_Chunk1` etc.) are signed and
encrypted independently. In addition, the complete blob is signed again using
`NONCE`. This enables repository reorganisation without having to touch the
encrypted chunks. In addition it also allows efficient indexing, for only the
header needs to be read in order to find out which chunks are contained in the
blob. Since the header is signed, authenticity of the header can be checked
without having to read the complete blob.
After decryption, a blob's header consists of the following elements:
Length(IV_Chunk1+Ciphertext_Chunk1+MAC_Chunk1) || Hash(Plaintext_Chunk1) ||
[...]
Length(IV_Chunk_n+Ciphertext_Chunk_n+MAC_Chunk_n) || Hash(Plaintext_Chunk_n) ||
This is enough to calculate the offsets for all the chunks in the blob. Length
is the length of the chunk as a four byte integer in little-endian format.
Indexing
--------
Index blobs pack together information about data and tree blobs and stores this
information in the repository. When the local cached index is not accessible
any more, the index files can be downloaded and used to reconstruct the index.
The index blobs are encrypted and signed like data and tree blobs, so the outer
structure is `IV || Ciphertext || MAC` again. The plaintext consists of a JSON
document like the following:
2015-04-09 17:59:16 +00:00
{
"73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c":
[
"3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce",
"9ccb846e60d90d4eb915848add7aa7ea1e4bbabfc60e573db9f7bfb2789afbae",
"d3dc577b4ffd38cc4b32122cabf8655a0223ed22edfd93b353dc0c3f2b0fdf66"
]
}
2015-04-08 19:42:13 +00:00
This JSON document lists all the blobs with the contents. In this example, the
blob `73d04e61` contains three chunks, the plaintext hashes are listed afterwards.
Keys, Encryption and MAC
------------------------
2015-01-02 21:45:02 +00:00
2015-03-22 20:26:14 +00:00
All data stored by restic in the repository is encrypted with AES-256 in
counter mode and signed with Poly1305-AES. For encrypting new data first 16
bytes are read from a cryptographically secure pseudorandom number generator as
a random nonce. This is used both as the IV for counter mode and the nonce for
Poly1305. This operation needs three keys: A 32 byte for AES-256 for
encryption, a 16 byte AES key and a 16 byte key for Poly1305. For details see
2015-03-22 21:46:38 +00:00
the original paper [The Poly1305-AES message-authentication
2015-03-22 20:29:19 +00:00
code](http://cr.yp.to/mac/poly1305-20050329.pdf) by Dan Bernstein.
The data is then encrypted with AES-256 and afterwards the MAC is computed over
the ciphertext, everything is then stored as IV || CIPHERTEXT || MAC.
2015-03-22 20:26:14 +00:00
2015-01-03 14:24:47 +00:00
The directory `keys` contains key files. These are simple JSON documents which
contain all data that is needed to derive the repository's master signing and
encryption keys from a user's password. The JSON document from the repository
can be pretty-printed for example by using the Python module `json` (shortened
to increase readability):
2015-01-02 17:43:59 +00:00
$ python -mjson.tool /tmp/restic-repo/keys/b02de82*
{
"hostname": "kasimir",
"username": "fd0"
"kdf": "scrypt",
"N": 65536,
"r": 8,
"p": 1,
"created": "2015-01-02T18:10:13.48307196+01:00",
"data": "tGwYeKoM0C4j4/9DFrVEmMGAldvEn/+iKC3te/QE/6ox/V4qz58FUOgMa0Bb1cIJ6asrypCx/Ti/pRXCPHLDkIJbNYd2ybC+fLhFIJVLCvkMS+trdywsUkglUbTbi+7+Ldsul5jpAj9vTZ25ajDc+4FKtWEcCWL5ICAOoTAxnPgT+Lh8ByGQBH6KbdWabqamLzTRWxePFoYuxa7yXgmj9A==",
"salt": "uW4fEI1+IOzj7ED9mVor+yTSJFd68DGlGOeLgJELYsTU5ikhG/83/+jGd4KKAaQdSrsfzrdOhAMftTSih5Ux6w==",
}
2015-01-03 14:24:47 +00:00
When the repository is opened by restic, the user is prompted for the
repository password. This is then used with `scrypt`, a key derivation function
(KDF), and the supplied parameters (`N`, `r`, `p` and `salt`) to derive 64 key
bytes. The first 32 bytes are used as the encryption key (for AES-256) and the
last 32 bytes are used as the signing key (for Poly1305-AES). These last 32
bytes are divided into a 16 byte AES key `k` followed by 16 bytes of secret key
2015-03-22 20:26:14 +00:00
`r`. They key `r` is then masked for use with Poly1305 (see the paper for
details).
2015-01-02 21:45:02 +00:00
This signing key is used to compute a MAC over the bytes contained in the
2015-01-03 14:24:47 +00:00
JSON field `data` (after removing the Base64 encoding and not including the
last 32 byte). If the password is incorrect or the key file has been tampered
with, the computed MAC will not match the last 16 bytes of the data, and
2015-01-03 14:24:47 +00:00
restic exits with an error. Otherwise, the data is decrypted with the
encryption key derived from `scrypt`. This yields a JSON document which
2015-04-07 18:09:25 +00:00
contains the master signing and encryption keys for this repository (encoded in
Base64) and the polynomial that is used for CDC. The command `restic cat
masterkey` can be used as follows to decrypt and pretty-print the master key:
$ restic -r /tmp/restic-repo cat masterkey
{
"sign": {
"k": "evFWd9wWlndL9jc501268g==",
"r": "E9eEDnSJZgqwTOkDtOp+Dw=="
},
2015-04-07 18:09:25 +00:00
"encrypt": "UQCqa0lKZ94PygPxMRqkePTZnHRYh1k1pX2k2lM2v3Q=",
"chunker_polynomial": "2f0797d9c2363f"
}
All data in the repository is encrypted and signed with these master keys with
AES-256 in Counter mode and signed with Poly1305-AES as described above.
2015-01-02 21:45:02 +00:00
A repository can have several different passwords, with a key file for each.
This way, the password can be changed without having to re-encrypt all data.
Snapshots
---------
2015-01-02 17:43:59 +00:00
2015-01-03 14:24:47 +00:00
A snapshots represents a directory with all files and sub-directories at a
given point in time. For each backup that is made, a new snapshot is created. A
2015-03-02 08:56:56 +00:00
snapshot is a JSON document that is stored in an encrypted file below the
directory `snapshots` in the repository. The filename is the SHA-256 hash of
the (encrypted) contents. This string is unique and used within restic to
uniquely identify a snapshot.
2015-01-02 21:45:02 +00:00
2015-01-03 14:24:47 +00:00
The command `restic cat snapshot` can be used as follows to decrypt and
pretty-print the contents of a snapshot file:
2015-01-02 21:45:02 +00:00
$ restic -r /tmp/restic-repo cat snapshot 22a5af1b
Enter Password for Repository:
{
"time": "2015-01-02T18:10:50.895208559+01:00",
2015-04-08 19:42:13 +00:00
"tree": "2da81727b6585232894cfbb8f8bdab8d1eccd3d8f7c92bc934d62e62e618ffdf",
2015-01-02 21:45:02 +00:00
"dir": "/tmp/testdata",
"hostname": "kasimir",
"username": "fd0",
"uid": 1000,
"gid": 100
}
2015-01-03 14:24:47 +00:00
Here it can be seen that this snapshot represents the contents of the directory
2015-01-17 15:57:16 +00:00
`/tmp/testdata`. The most important field is `tree`.
2015-01-02 21:45:02 +00:00
2015-01-03 14:24:47 +00:00
All content within a restic repository is referenced according to its SHA-256
hash. Before saving, each file is split into variable sized chunks of data. The
SHA-256 hashes of all chunks are saved in an ordered list which then represents
2015-01-17 15:57:16 +00:00
the content of the file.
2015-01-02 21:45:02 +00:00
2015-01-17 15:57:16 +00:00
In order to relate these plain text hashes to the actual encrypted storage
2015-04-08 19:42:13 +00:00
hashes (which vary due to random IVs), an index is used. If the index is not
available, the header of all data blobs can be read.
2015-01-02 21:45:02 +00:00
Trees and Data
--------------
2015-01-17 15:57:16 +00:00
A snapshot references a tree by the SHA-256 hash of the JSON string
representation of its contents. Trees are saved in a subdirectory of the
directory `trees`. The sub directory's name is the first two characters of the
filename the tree object is stored in.
2015-01-02 21:45:02 +00:00
The command `restic cat tree` can be used to inspect the tree referenced above:
2015-01-17 15:57:16 +00:00
$ restic -r /tmp/restic-repo cat tree b8138ab08a4722596ac89c917827358da4672eac68e3c03a8115b88dbf4bfb59
2015-01-02 21:45:02 +00:00
Enter Password for Repository:
2015-01-17 15:57:16 +00:00
{
"nodes": [
{
"name": "testdata",
"type": "dir",
"mode": 493,
"mtime": "2014-12-22T14:47:59.912418701+01:00",
"atime": "2014-12-06T17:49:21.748468803+01:00",
"ctime": "2014-12-22T14:47:59.912418701+01:00",
"uid": 1000,
"gid": 100,
"user": "fd0",
"inode": 409704562,
"content": null,
"subtree": "b26e315b0988ddcd1cee64c351d13a100fedbc9fdbb144a67d1b765ab280b4dc"
}
]
}
A tree contains a list of entries (in the field `nodes`) which contain meta
data like a name and timestamps. When the entry references a directory, the
2015-04-08 19:42:13 +00:00
field `subtree` contains the plain text ID of another tree object.
2015-01-17 15:57:16 +00:00
When the command `restic cat tree` is used, the storage hash is needed to print
a tree. The tree referenced above can be dumped as follows:
$ restic -r /tmp/restic-repo cat tree 8b238c8811cc362693e91a857460c78d3acf7d9edb2f111048691976803cf16e
2015-01-03 14:24:47 +00:00
Enter Password for Repository:
2015-01-17 15:57:16 +00:00
{
"nodes": [
{
"name": "testfile",
"type": "file",
"mode": 420,
"mtime": "2014-12-06T17:50:23.34513538+01:00",
"atime": "2014-12-06T17:50:23.338468713+01:00",
"ctime": "2014-12-06T17:50:23.34513538+01:00",
"uid": 1000,
"gid": 100,
"user": "fd0",
"inode": 416863351,
"size": 1234,
"links": 1,
"content": [
"50f77b3b4291e8411a027b9f9b9e64658181cc676ce6ba9958b95f268cb1109d"
]
},
[...]
]
}
2015-01-02 21:45:02 +00:00
2015-01-17 15:57:16 +00:00
This tree contains a file entry. This time, the `subtree` field is not present
2015-04-08 19:42:13 +00:00
and the `content` field contains a list with one plain text SHA-256 hash.
2015-01-02 21:45:02 +00:00
2015-01-17 15:57:16 +00:00
The command `restic cat data` can be used to extract and decrypt data given a
storage hash, e.g. for the data mentioned above:
2015-01-02 21:45:02 +00:00
2015-01-17 15:57:16 +00:00
$ restic -r /tmp/restic-repo cat blob 00634c46e5f7c055c341acd1201cf8289cabe769f991d6e350f8cd8ce2a52ac3 | sha256sum
2015-01-03 14:24:47 +00:00
Enter Password for Repository:
2015-01-02 21:45:02 +00:00
50f77b3b4291e8411a027b9f9b9e64658181cc676ce6ba9958b95f268cb1109d -
2015-01-17 15:57:16 +00:00
As can be seen from the output of the program `sha256sum`, the hash matches the
plaintext hash from the map included in the tree above, so the correct data has
been returned.
2015-01-02 17:43:59 +00:00
Backups and Deduplication
=========================
2015-01-03 14:24:47 +00:00
For creating a backup, restic scans the target directory for all files,
sub-directories and other entries. The data from each file is split into
variable length chunks cut at offsets defined by a sliding window of 64 byte.
The implementation uses Rabin Fingerprints for implementing this Content
2015-04-07 18:09:25 +00:00
Defined Chunking (CDC). An irreducible polynomial is selected at random when a
repository is initialized.
2015-01-03 14:24:47 +00:00
Files smaller than 512 KiB are not split, chunks are of 512 KiB to 8 MiB in
size. The implementation aims for 1 MiB chunk size on average.
For modified files, only modified chunks have to be saved in a subsequent
backup. This even works if bytes are inserted or removed at arbitrary positions
within the file.
2015-01-02 17:43:59 +00:00
2015-01-03 14:28:33 +00:00
Threat Model
============
The design goals for restic include being able to securely store backups in a
location that is not completely trusted, e.g. a shared system where others can
potentially access the files or (in the case of the system administrator) even
modify or delete them.
General assumptions:
* The host system a backup is created on is trusted. This is the most basic
requirement, and essential for creating trustworthy backups.
The restic backup program guarantees the following:
2015-01-03 16:05:37 +00:00
* Accessing the unencrypted content of stored files and meta data should not
be possible without a password for the repository. Everything except the
2015-04-07 17:57:06 +00:00
`version` and `id` files and the meta data included for informational
purposes in the key files is encrypted and then signed.
2015-01-03 14:28:33 +00:00
* Modifications (intentional or unintentional) can be detected automatically
on several layers:
1. For all accesses of data stored in the repository it is checked whether
the cryptographic hash of the contents matches the storage ID (the
file's name). This way, modifications (bad RAM, broken harddisk) can be
detected easily.
2. Before decrypting any data, the MAC signature on the encrypted data is
2015-01-03 14:28:33 +00:00
checked. If there has been a modification, the signature check will
fail. This step happens even before the data is decrypted, so data that
has been tampered with is not decrypted at all.
However, the restic backup program is not designed to protect against attackers
deleting files at the storage location. There is nothing that can be done about
this. If this needs to be guaranteed, get a secure location without any access
from third parties. If you assume that attackers have write access to your
files at the storage location, attackers are able to figure out (e.g. based on
the timestamps of the stored files) which files belong to what snapshot. When
only these files are deleted, the particular snapshot vanished and all
snapshots depending on data that has been added in the snapshot cannot be
restored completely. Restic is not designed to detect this attack.