faq: rewrote IntegrityError answer

2024-12-25 09:19:31 +00:00 · 2017-05-19 18:40:59 +02:00 · 2017-05-19 18:40:59 +02:00 · 0bcb8c2a39
commit 0bcb8c2a39
parent be40e2fcfa
2 changed files with 68 additions and 26 deletions
--- a/docs/changes.rst
+++ b/docs/changes.rst
@ -1,3 +1,6 @@
 .. _important_notes:
 Important notes
 ===============
--- a/docs/faq.rst
+++ b/docs/faq.rst
@ -194,23 +194,48 @@ repo. It will then be able to check using CRCs and HMACs.
 I get an IntegrityError or similar - what now?
 ----------------------------------------------
-The first step should be to check whether it's a problem with the disk drive,
+A single error does not necessarily indicate bad hardware or a Borg
-IntegrityErrors can be a sign of drive failure or other hardware issues.
+bug. All hardware exhibits a bit error rate (BER). Hard drives are typically
 specified as exhibiting less than one error every 12 to 120 TB
 (one bit error in 10e14 to 10e15 bits). The specification is often called
 *unrecoverable read error rate* (URE rate).
-Using the smartmontools one can retrieve self-diagnostics of the drive in question
+Apart from these very rare errors there are two main causes of errors:
-(where the repository is located, use *findmnt*, *mount* or *lsblk* to find the
+
-*/dev/...* path of the drive)::
+(i) Defective hardware: described below.
 (ii) Bugs in software (Borg, operating system, libraries):
     Ensure software is up to date.
     Check whether the issue is caused by any fixed bugs described in :ref:`important_notes`.
 .. rubric:: Finding defective hardware
 .. note::
   Hardware diagnostics are operating system dependent and do not
   apply universally. The commands shown apply for popular Unix-like
   systems. Refer to your operating system's manual.
 Checking hard drives
  Find the drive containing the repository and use *findmnt*, *mount* or *lsblk*
  to learn the device path (typically */dev/...*) of the drive.
  Then, smartmontools can retrieve self-diagnostics of the drive in question::
      # smartctl -a /dev/sdSomething
-Attributes that are a typical cause of data corruption are *Offline_Uncorrectable*,
+  The *Offline_Uncorrectable*, *Current_Pending_Sector* and *Reported_Uncorrect*
-*Current_Pending_Sector*, *Reported_Uncorrect*. A high *UDMA_CRC_Error_Count* usually
+  attributes indicate data corruption. A high *UDMA_CRC_Error_Count* usually
-indicates a bad cable. If the *entire drive* is failing, then all data should be copied
+  indicates a bad cable.
 off it as soon as possible.
-Some drives log IO errors, which are also logged by the system (refer to the journal/dmesg).
+  I/O errors logged by the system (refer to the system journal or
-IO errors that impact only the filesystem can go unnoticed, since they are not reported
+  dmesg) can point to issues as well. I/O errors only affecting the
-to applications (e.g. Borg), but can still corrupt data.
+  file system easily go unnoticed, since they are not reported to
  applications (e.g. Borg), while these errors can still corrupt data.
  Drives can corrupt some sectors in one event, while remaining
  reliable otherwise. Conversely, drives can fail completely with no
  advance warning. If in doubt, copy all data from the drive in
  question to another drive -- just in case it fails completely.
  If any of these are suspicious, a self-test is recommended::
@ -218,17 +243,31 @@ If any of these are suspicious, a self-test is recommended::
  Running ``fsck`` if not done already might yield further insights.
 Checking memory
  Intermittent issues, such as ``borg check`` finding errors
  inconsistently between runs, are frequently caused by bad memory.
  Run memtest86+ (or an equivalent memory tester) to verify that
  the memory subsystem is operating correctly.
 Checking processors
  Processors rarely cause errors. If they do, they are usually overclocked
  or otherwise operated outside their specifications. We do not recommend to
  operate hardware outside its specifications for productive use.
  Tools to verify correct processor operation include Prime95 (mprime), linpack,
  and the `Intel Processor Diagnostic Tool
  <https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool>`_
  (applies only to Intel processors).
 .. rubric:: Repairing a damaged repository
 With any defective hardware found and replaced, the damage done to the repository
 needs to be ascertained and fixed.
 :ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with
-issues. We recommend to first run without ``--repair`` to assess the situation and
+issues. We recommend to first run without ``--repair`` to assess the situation.
-if the found issues / proposed repairs sound right re-run it with ``--repair`` enabled.
+If the found issues and proposed repairs seem right, re-run "check" with ``--repair`` enabled.
 When errors are intermittent the cause might be bad memory, running memtest86+ or a similar
 test is recommended.
 A single error does not indicate bad hardware or a Borg bug -- all hardware has a certain
 bit error rate (BER), for hard drives this is typically specified as less than one error
 every 12 to 120 TB (one bit error in 10e14 to 10e15 bits) and often called
 *unrecoverable read error rate* (URE rate).
 Security
 ########