Merge pull request #2529 from enkore/faq/errors

faq: I get an IntegrityError or similar - what now?
2025-03-10 14:15:43 +00:00 · 2017-05-22 20:34:59 +02:00 · 2017-05-22 20:34:59 +02:00 · decee5389b
commit decee5389b
parent a430fd5d1f 0bcb8c2a39
2 changed files with 85 additions and 0 deletions
--- a/docs/changes.rst
+++ b/docs/changes.rst
@ -1,3 +1,6 @@
+
+.. _important_notes:
+
 Important notes
 ===============

--- a/docs/faq.rst
+++ b/docs/faq.rst
@ -108,6 +108,9 @@ Are there other known limitations?
  An easy workaround is to create multiple archives with less items each.
  See also the :ref:`archive_limitation` and :issue:`1452`.

+  :ref:`borg_info` shows how large (relative to the maximum size) existing
+  archives are.
+
 Why is my backup bigger than with attic?
 ----------------------------------------

@ -186,6 +189,85 @@ Yes, if you want to detect accidental data damage (like bit rot), use the
 If you want to be able to detect malicious tampering also, use an encrypted
 repo. It will then be able to check using CRCs and HMACs.

+.. _faq-integrityerror:
+
+I get an IntegrityError or similar - what now?
+----------------------------------------------
+
+A single error does not necessarily indicate bad hardware or a Borg
+bug. All hardware exhibits a bit error rate (BER). Hard drives are typically
+specified as exhibiting less than one error every 12 to 120 TB
+(one bit error in 10e14 to 10e15 bits). The specification is often called
+*unrecoverable read error rate* (URE rate).
+
+Apart from these very rare errors there are two main causes of errors:
+
+(i) Defective hardware: described below.
+(ii) Bugs in software (Borg, operating system, libraries):
+     Ensure software is up to date.
+     Check whether the issue is caused by any fixed bugs described in :ref:`important_notes`.
+
+
+.. rubric:: Finding defective hardware
+
+.. note::
+
+   Hardware diagnostics are operating system dependent and do not
+   apply universally. The commands shown apply for popular Unix-like
+   systems. Refer to your operating system's manual.
+
+Checking hard drives
+  Find the drive containing the repository and use *findmnt*, *mount* or *lsblk*
+  to learn the device path (typically */dev/...*) of the drive.
+  Then, smartmontools can retrieve self-diagnostics of the drive in question::
+
+      # smartctl -a /dev/sdSomething
+
+  The *Offline_Uncorrectable*, *Current_Pending_Sector* and *Reported_Uncorrect*
+  attributes indicate data corruption. A high *UDMA_CRC_Error_Count* usually
+  indicates a bad cable.
+
+  I/O errors logged by the system (refer to the system journal or
+  dmesg) can point to issues as well. I/O errors only affecting the
+  file system easily go unnoticed, since they are not reported to
+  applications (e.g. Borg), while these errors can still corrupt data.
+
+  Drives can corrupt some sectors in one event, while remaining
+  reliable otherwise. Conversely, drives can fail completely with no
+  advance warning. If in doubt, copy all data from the drive in
+  question to another drive -- just in case it fails completely.
+
+  If any of these are suspicious, a self-test is recommended::
+
+      # smartctl -t long /dev/sdSomething
+
+  Running ``fsck`` if not done already might yield further insights.
+
+Checking memory
+  Intermittent issues, such as ``borg check`` finding errors
+  inconsistently between runs, are frequently caused by bad memory.
+
+  Run memtest86+ (or an equivalent memory tester) to verify that
+  the memory subsystem is operating correctly.
+
+Checking processors
+  Processors rarely cause errors. If they do, they are usually overclocked
+  or otherwise operated outside their specifications. We do not recommend to
+  operate hardware outside its specifications for productive use.
+
+  Tools to verify correct processor operation include Prime95 (mprime), linpack,
+  and the `Intel Processor Diagnostic Tool
+  <https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool>`_
+  (applies only to Intel processors).
+
+.. rubric:: Repairing a damaged repository
+
+With any defective hardware found and replaced, the damage done to the repository
+needs to be ascertained and fixed.
+
+:ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with
+issues. We recommend to first run without ``--repair`` to assess the situation.
+If the found issues and proposed repairs seem right, re-run "check" with ``--repair`` enabled.

 Security
 ########