borg/docs/security.rst


.. somewhat surprisingly the "bash" highlighter gives nice results with
   the pseudo-code notation used in the "Encryption" section.

.. highlight:: bash

========
Security
========

Cryptography in Borg
====================

Attack model
------------

The attack model of Borg is that the environment of the client process
(e.g. ``borg create``) is trusted and the repository (server) is not. The
attacker has any and all access to the repository, including interactive
manipulation (man-in-the-middle) for remote repositories.

Furthermore the client environment is assumed to be persistent across
attacks (practically this means that the security database cannot be
deleted between attacks).

Under these circumstances Borg guarantees that the attacker cannot

1. modify the data of any archive without the client detecting the change
2. rename, remove or add an archive without the client detecting the change
3. recover plain-text data
4. recover definite (heuristics based on access patterns are possible)
   structural information such as the object graph (which archives
   refer to what chunks)

The attacker can always impose a denial of service per definition (he could
forbid connections to the repository, or delete it entirely).

Structural Authentication
-------------------------

Borg is fundamentally based on an object graph structure (see :ref:`internals`),
where the root object is called the manifest.

Borg follows the `Horton principle`_, which states that
not only the message must be authenticated, but also its meaning (often
expressed through context), because every object used is referenced by a
parent object through its object ID up to the manifest. The object ID in
Borg is a MAC of the object's plaintext, therefore this ensures that
an attacker cannot change the context of an object without forging the MAC.

In other words, the object ID itself only authenticates the plaintext of the
object and not its context or meaning. The latter is established by a different
object referring to an object ID, thereby assigning a particular meaning to
an object. For example, an archive item contains a list of object IDs that
represent packed file metadata. On their own it's not clear that these objects
would represent what they do, but by the archive item referring to them
in a particular part of its own data structure assigns this meaning.

This results in a directed acyclic graph of authentication from the manifest
to the data chunks of individual files.

.. rubric:: Authenticating the manifest

Since the manifest has a fixed ID (000...000) the aforementioned authentication
does not apply to it, indeed, cannot apply to it; it is impossible to authenticate
the root node of a DAG through its edges, since the root node has no incoming edges.

With the scheme as described so far an attacker could easily replace the manifest,
therefore Borg includes a tertiary authentication mechanism (TAM) that is applied
to the manifest since version 1.0.9 (see :ref:`tam_vuln`).

TAM works by deriving a separate key through HKDF_ from the other encryption and
authentication keys and calculating the HMAC of the metadata to authenticate [#]_::

    # RANDOM(n) returns n random bytes
    salt = RANDOM(64)

    ikm = id_key || enc_key || enc_hmac_key
    # *context* depends on the operation, for manifest authentication it is
    # the ASCII string "borg-metadata-authentication-manifest".
    tam_key = HKDF-SHA-512(ikm, salt, context)

    # *data* is a dict-like structure
    data[hmac] = zeroes
    packed = pack(data)
    data[hmac] = HMAC(tam_key, packed)
    packed_authenticated = pack(data)

Since an attacker cannot gain access to this key and also cannot make the
client authenticate arbitrary data using this mechanism, the attacker is unable
to forge the authentication.

This effectively 'anchors' the manifest to the key, which is controlled by the
client, thereby anchoring the entire DAG, making it impossible for an attacker
to add, remove or modify any part of the DAG without Borg being able to detect
the tampering.

Note that when using BORG_PASSPHRASE the attacker cannot swap the *entire*
repository against a new repository with e.g. repokey mode and no passphrase,
because Borg will abort access when BORG_PASSPRHASE is incorrect.

However, interactively a user might not notice this kind of attack
immediately, if she assumes that the reason for the absent passphrase
prompt is a set BORG_PASSPHRASE. See issue :issue:`2169` for details.

.. [#] The reason why the authentication tag is stored in the packed
       data itself is that older Borg versions can still read the
       manifest this way, while a changed layout would have broken
       compatibility.

Encryption
----------

Encryption is currently based on the Encrypt-then-MAC construction,
which is generally seen as the most robust way to create an authenticated
encryption scheme from encryption and message authentication primitives.

Every operation (encryption, MAC / authentication, chunk ID derivation)
uses independent, random keys generated by `os.urandom`_ [#]_.

Borg does not support unauthenticated encryption -- only authenticated encryption
schemes are supported. No unauthenticated encryption schemes will be added
in the future.

Depending on the chosen mode (see :ref:`borg_init`) different primitives are used:

- The actual encryption is currently always AES-256 in CTR mode. The
  counter is added in plaintext, since it is needed for decryption,
  and is also tracked locally on the client to avoid counter reuse.

- The authentication primitive is either HMAC-SHA-256 or BLAKE2b-256
  in a keyed mode. HMAC-SHA-256 uses 256 bit keys, while BLAKE2b-256
  uses 512 bit keys.

  The latter is secure not only because BLAKE2b itself is not
  susceptible to `length extension`_, but also since it truncates the
  hash output from 512 bits to 256 bits, which would make the
  construction safe even if BLAKE2b were broken regarding length
  extension or similar attacks.

- The primitive used for authentication is always the same primitive
  that is used for deriving the chunk ID, but they are always
  used with independent keys.

Encryption::

    id = AUTHENTICATOR(id_key, data)
    compressed = compress(data)

    iv = reserve_iv()
    encrypted = AES-256-CTR(enc_key, 8-null-bytes || iv, compressed)
    authenticated = type-byte || AUTHENTICATOR(enc_hmac_key, encrypted) || iv || encrypted


Decryption::

    # Given: input *authenticated* data, possibly a *chunk-id* to assert
    type-byte, mac, iv, encrypted = SPLIT(authenticated)

    ASSERT(type-byte is correct)
    ASSERT( CONSTANT-TIME-COMPARISON( mac, AUTHENTICATOR(enc_hmac_key, encrypted) ) )

    decrypted = AES-256-CTR(enc_key, 8-null-bytes || iv, encrypted)
    decompressed = decompress(decrypted)

    ASSERT( CONSTANT-TIME-COMPARISON( chunk-id, AUTHENTICATOR(id_key, decompressed) ) )

.. [#] Using the :ref:`borg key migrate-to-repokey <borg_key_migrate-to-repokey>`
       command a user can convert repositories created using Attic in "passphrase"
       mode to "repokey" mode. In this case the keys were directly derived from
       the user's passphrase at some point using PBKDF2.

       Borg does not support "passphrase" mode otherwise any more.

Offline key security
--------------------

Borg cannot secure the key material while it is running, because the keys
are needed in plain to decrypt/encrypt repository objects.

For offline storage of the encryption keys they are encrypted with a
user-chosen passphrase.

A 256 bit key encryption key (KEK) is derived from the passphrase
using PBKDF2-HMAC-SHA256 with a random 256 bit salt which is then used
to Encrypt-then-MAC a packed representation of the keys with
AES-256-CTR with a constant initialization vector of 0 (this is the
same construction used for Encryption_ with HMAC-SHA-256).

The resulting MAC is stored alongside the ciphertext, which is
converted to base64 in its entirety.

This base64 blob (commonly referred to as *keyblob*) is then stored in
the key file or in the repository config (keyfile and repokey modes
respectively).

This scheme, and specifically the use of a constant IV with the CTR
mode, is secure because an identical passphrase will result in a
different derived KEK for every encryption due to the salt.

Implementations used
--------------------

We do not implement cryptographic primitives ourselves, but rely
on widely used libraries providing them:

- AES-CTR and HMAC-SHA-256 from OpenSSL 1.0 / 1.1 are used,
  which is also linked into the static binaries we provide.
  We think this is not an additional risk, since we don't ever
  use OpenSSL's networking, TLS or X.509 code, but only their
  primitives implemented in libcrypto.
- SHA-256 and SHA-512 from Python's hashlib_ standard library module are used
- HMAC, PBKDF2 and a constant-time comparison from Python's hmac_ standard
  library module is used.
- BLAKE2b is either provided by the system's libb2, an official implementation,
  or a bundled copy of the BLAKE2 reference implementation (written in C).

Implemented cryptographic constructions are:

- Encrypt-then-MAC based on AES-256-CTR and either HMAC-SHA-256
  or keyed BLAKE2b256 as described above under Encryption_.
- HKDF_-SHA-512

.. _Horton principle: https://en.wikipedia.org/wiki/Horton_Principle
.. _HKDF: https://tools.ietf.org/html/rfc5869
.. _length extension: https://en.wikipedia.org/wiki/Length_extension_attack
.. _hashlib: https://docs.python.org/3/library/hashlib.html
.. _hmac: https://docs.python.org/3/library/hmac.html
.. _os.urandom: https://docs.python.org/3/library/os.html#os.urandom

Remote RPC protocol security
============================

.. note:: This section could be further expanded / detailed.

The RPC protocol is fundamentally based on msgpack'd messages exchanged
over an encrypted SSH channel (the system's SSH client is used for this
by piping data from/to it).

This means that the authorization and transport security properties
are inherited from SSH and the configuration of the SSH client
and the SSH server. Therefore the remainder of this section
will focus on the security of the RPC protocol within Borg.

The assumed worst-case a server can inflict to a client is a
denial of repository service.

The situation were a server can create a general DoS on the client
should be avoided, but might be possible by e.g. forcing the client to
allocate large amounts of memory to decode large messages (or messages
that merely indicate a large amount of data follows). See issue
:issue:`2139` for details.

We believe that other kinds of attacks, especially critical vulnerabilities
like remote code execution are inhibited by the design of the protocol:

1. The server cannot send requests to the client on its own accord,
   it only can send responses. This avoids "unexpected inversion of control"
   issues.
2. msgpack serialization does not allow embedding or referencing code that
   is automatically executed. Incoming messages are unpacked by the msgpack
   unpacker into native Python data structures (like tuples and dictionaries),
   which are then passed to the rest of the program.

   Additional verification of the correct form of the responses could be implemented.
3. Remote errors are presented in two forms:

   1. A simple plain-text *stderr* channel. A prefix string indicates the kind of message
      (e.g. WARNING, INFO, ERROR), which is used to suppress it according to the
      log level selected in the client.

      A server can send arbitrary log messages, which may confuse a user. However,
      log messages are only processed when server requests are in progress, therefore
      the server cannot interfere / confuse with security critical dialogue like
      the password prompt.
   2. Server-side exceptions passed over the main data channel. These follow the
      general pattern of server-sent responses and are sent instead of response data
      for a request.
docs: add edited "Cryptography in Borg" and "Remote RPC protocol security" sections The former section is a bit older (Nov 2016) and has been the piece responsible for finding CVE-2016-10099, since while writing it I wondered how the manifest was authenticated to actually be the manifest. Well. There it is ;) It has been edited to final form only recently and should now be ready for review. The latter section is new. 2017-02-14 20:40:20 +00:00
			`.. somewhat surprisingly the "bash" highlighter gives nice results with`
			`the pseudo-code notation used in the "Encryption" section.`

			`.. highlight:: bash`

			`========`
			`Security`
			`========`

			`Cryptography in Borg`
			`====================`

			`Attack model`
			`------------`

			`The attack model of Borg is that the environment of the client process`
			(e.g. ``borg create``) is trusted and the repository (server) is not. The
			`attacker has any and all access to the repository, including interactive`
			`manipulation (man-in-the-middle) for remote repositories.`

			`Furthermore the client environment is assumed to be persistent across`
			`attacks (practically this means that the security database cannot be`
			`deleted between attacks).`

			`Under these circumstances Borg guarantees that the attacker cannot`

			`1. modify the data of any archive without the client detecting the change`
			`2. rename, remove or add an archive without the client detecting the change`
			`3. recover plain-text data`
			`4. recover definite (heuristics based on access patterns are possible)`
			`structural information such as the object graph (which archives`
			`refer to what chunks)`

			`The attacker can always impose a denial of service per definition (he could`
			`forbid connections to the repository, or delete it entirely).`

			`Structural Authentication`
			`-------------------------`

			Borg is fundamentally based on an object graph structure (see :ref:`internals`),
			`where the root object is called the manifest.`

			Borg follows the `Horton principle`_, which states that
			`not only the message must be authenticated, but also its meaning (often`
			`expressed through context), because every object used is referenced by a`
			`parent object through its object ID up to the manifest. The object ID in`
			`Borg is a MAC of the object's plaintext, therefore this ensures that`
			`an attacker cannot change the context of an object without forging the MAC.`

			`In other words, the object ID itself only authenticates the plaintext of the`
			`object and not its context or meaning. The latter is established by a different`
			`object referring to an object ID, thereby assigning a particular meaning to`
			`an object. For example, an archive item contains a list of object IDs that`
			`represent packed file metadata. On their own it's not clear that these objects`
			`would represent what they do, but by the archive item referring to them`
			`in a particular part of its own data structure assigns this meaning.`

			`This results in a directed acyclic graph of authentication from the manifest`
			`to the data chunks of individual files.`

			`.. rubric:: Authenticating the manifest`

			`Since the manifest has a fixed ID (000...000) the aforementioned authentication`
			`does not apply to it, indeed, cannot apply to it; it is impossible to authenticate`
			`the root node of a DAG through its edges, since the root node has no incoming edges.`

			`With the scheme as described so far an attacker could easily replace the manifest,`
			`therefore Borg includes a tertiary authentication mechanism (TAM) that is applied`
			to the manifest since version 1.0.9 (see :ref:`tam_vuln`).

			`TAM works by deriving a separate key through HKDF_ from the other encryption and`
			`authentication keys and calculating the HMAC of the metadata to authenticate [#]_::`

			`# RANDOM(n) returns n random bytes`
			`salt = RANDOM(64)`

			`ikm = id_key \|\| enc_key \|\| enc_hmac_key`
			`# context depends on the operation, for manifest authentication it is`
			`# the ASCII string "borg-metadata-authentication-manifest".`
			`tam_key = HKDF-SHA-512(ikm, salt, context)`

			`# data is a dict-like structure`
			`data[hmac] = zeroes`
			`packed = pack(data)`
			`data[hmac] = HMAC(tam_key, packed)`
			`packed_authenticated = pack(data)`

			`Since an attacker cannot gain access to this key and also cannot make the`
			`client authenticate arbitrary data using this mechanism, the attacker is unable`
			`to forge the authentication.`

			`This effectively 'anchors' the manifest to the key, which is controlled by the`
			`client, thereby anchoring the entire DAG, making it impossible for an attacker`
			`to add, remove or modify any part of the DAG without Borg being able to detect`
			`the tampering.`

			`Note that when using BORG_PASSPHRASE the attacker cannot swap the entire`
			`repository against a new repository with e.g. repokey mode and no passphrase,`
			`because Borg will abort access when BORG_PASSPRHASE is incorrect.`

			`However, interactively a user might not notice this kind of attack`
			`immediately, if she assumes that the reason for the absent passphrase`
			prompt is a set BORG_PASSPHRASE. See issue :issue:`2169` for details.

			`.. [#] The reason why the authentication tag is stored in the packed`
			`data itself is that older Borg versions can still read the`
			`manifest this way, while a changed layout would have broken`
			`compatibility.`

			`Encryption`
			`----------`

			`Encryption is currently based on the Encrypt-then-MAC construction,`
			`which is generally seen as the most robust way to create an authenticated`
			`encryption scheme from encryption and message authentication primitives.`

			`Every operation (encryption, MAC / authentication, chunk ID derivation)`
			uses independent, random keys generated by `os.urandom`_ [#]_.

			`Borg does not support unauthenticated encryption -- only authenticated encryption`
			`schemes are supported. No unauthenticated encryption schemes will be added`
			`in the future.`

			Depending on the chosen mode (see :ref:`borg_init`) different primitives are used:

			`- The actual encryption is currently always AES-256 in CTR mode. The`
			`counter is added in plaintext, since it is needed for decryption,`
			`and is also tracked locally on the client to avoid counter reuse.`

			`- The authentication primitive is either HMAC-SHA-256 or BLAKE2b-256`
			`in a keyed mode. HMAC-SHA-256 uses 256 bit keys, while BLAKE2b-256`
			`uses 512 bit keys.`

			`The latter is secure not only because BLAKE2b itself is not`
			susceptible to `length extension`_, but also since it truncates the
			`hash output from 512 bits to 256 bits, which would make the`
			`construction safe even if BLAKE2b were broken regarding length`
			`extension or similar attacks.`

			`- The primitive used for authentication is always the same primitive`
			`that is used for deriving the chunk ID, but they are always`
			`used with independent keys.`

			`Encryption::`

			`id = AUTHENTICATOR(id_key, data)`
			`compressed = compress(data)`

			`iv = reserve_iv()`
			`encrypted = AES-256-CTR(enc_key, 8-null-bytes \|\| iv, compressed)`
			`authenticated = type-byte \|\| AUTHENTICATOR(enc_hmac_key, encrypted) \|\| iv \|\| encrypted`


			`Decryption::`

			`# Given: input authenticated data, possibly a chunk-id to assert`
			`type-byte, mac, iv, encrypted = SPLIT(authenticated)`

			`ASSERT(type-byte is correct)`
			`ASSERT( CONSTANT-TIME-COMPARISON( mac, AUTHENTICATOR(enc_hmac_key, encrypted) ) )`

			`decrypted = AES-256-CTR(enc_key, 8-null-bytes \|\| iv, encrypted)`
			`decompressed = decompress(decrypted)`

			`ASSERT( CONSTANT-TIME-COMPARISON( chunk-id, AUTHENTICATOR(id_key, decompressed) ) )`

			.. [#] Using the :ref:`borg key migrate-to-repokey <borg_key_migrate-to-repokey>`
			`command a user can convert repositories created using Attic in "passphrase"`
			`mode to "repokey" mode. In this case the keys were directly derived from`
			`the user's passphrase at some point using PBKDF2.`

			`Borg does not support "passphrase" mode otherwise any more.`

			`Offline key security`
			`--------------------`

			`Borg cannot secure the key material while it is running, because the keys`
			`are needed in plain to decrypt/encrypt repository objects.`

			`For offline storage of the encryption keys they are encrypted with a`
			`user-chosen passphrase.`

			`A 256 bit key encryption key (KEK) is derived from the passphrase`
			`using PBKDF2-HMAC-SHA256 with a random 256 bit salt which is then used`
			`to Encrypt-then-MAC a packed representation of the keys with`
			`AES-256-CTR with a constant initialization vector of 0 (this is the`
			`same construction used for Encryption_ with HMAC-SHA-256).`

			`The resulting MAC is stored alongside the ciphertext, which is`
			`converted to base64 in its entirety.`

			`This base64 blob (commonly referred to as keyblob) is then stored in`
			`the key file or in the repository config (keyfile and repokey modes`
			`respectively).`

			`This scheme, and specifically the use of a constant IV with the CTR`
			`mode, is secure because an identical passphrase will result in a`
			`different derived KEK for every encryption due to the salt.`

			`Implementations used`
			`--------------------`

			`We do not implement cryptographic primitives ourselves, but rely`
			`on widely used libraries providing them:`

			`- AES-CTR and HMAC-SHA-256 from OpenSSL 1.0 / 1.1 are used,`
			`which is also linked into the static binaries we provide.`
			`We think this is not an additional risk, since we don't ever`
			`use OpenSSL's networking, TLS or X.509 code, but only their`
			`primitives implemented in libcrypto.`
			`- SHA-256 and SHA-512 from Python's hashlib_ standard library module are used`
			`- HMAC, PBKDF2 and a constant-time comparison from Python's hmac_ standard`
			`library module is used.`
			`- BLAKE2b is either provided by the system's libb2, an official implementation,`
			`or a bundled copy of the BLAKE2 reference implementation (written in C).`

			`Implemented cryptographic constructions are:`

			`- Encrypt-then-MAC based on AES-256-CTR and either HMAC-SHA-256`
			`or keyed BLAKE2b256 as described above under Encryption_.`
			`- HKDF_-SHA-512`

			`.. _Horton principle: https://en.wikipedia.org/wiki/Horton_Principle`
			`.. _HKDF: https://tools.ietf.org/html/rfc5869`
			`.. _length extension: https://en.wikipedia.org/wiki/Length_extension_attack`
			`.. _hashlib: https://docs.python.org/3/library/hashlib.html`
			`.. _hmac: https://docs.python.org/3/library/hmac.html`
			`.. _os.urandom: https://docs.python.org/3/library/os.html#os.urandom`

			`Remote RPC protocol security`
			`============================`

			`.. note:: This section could be further expanded / detailed.`

			`The RPC protocol is fundamentally based on msgpack'd messages exchanged`
			`over an encrypted SSH channel (the system's SSH client is used for this`
			`by piping data from/to it).`

			`This means that the authorization and transport security properties`
			`are inherited from SSH and the configuration of the SSH client`
			`and the SSH server. Therefore the remainder of this section`
			`will focus on the security of the RPC protocol within Borg.`

			`The assumed worst-case a server can inflict to a client is a`
			`denial of repository service.`

			`The situation were a server can create a general DoS on the client`
			`should be avoided, but might be possible by e.g. forcing the client to`
			`allocate large amounts of memory to decode large messages (or messages`
			`that merely indicate a large amount of data follows). See issue`
			:issue:`2139` for details.

			`We believe that other kinds of attacks, especially critical vulnerabilities`
			`like remote code execution are inhibited by the design of the protocol:`

			`1. The server cannot send requests to the client on its own accord,`
			`it only can send responses. This avoids "unexpected inversion of control"`
			`issues.`
			`2. msgpack serialization does not allow embedding or referencing code that`
			`is automatically executed. Incoming messages are unpacked by the msgpack`
			`unpacker into native Python data structures (like tuples and dictionaries),`
			`which are then passed to the rest of the program.`

			`Additional verification of the correct form of the responses could be implemented.`
			`3. Remote errors are presented in two forms:`

			`1. A simple plain-text stderr channel. A prefix string indicates the kind of message`
			`(e.g. WARNING, INFO, ERROR), which is used to suppress it according to the`
			`log level selected in the client.`

			`A server can send arbitrary log messages, which may confuse a user. However,`
			`log messages are only processed when server requests are in progress, therefore`
			`the server cannot interfere / confuse with security critical dialogue like`
			`the password prompt.`
			`2. Server-side exceptions passed over the main data channel. These follow the`
			`general pattern of server-sent responses and are sent instead of response data`
			`for a request.`