From be40e2fcfab5518f7ed4b18e1ab0eae2f59f9e54 Mon Sep 17 00:00:00 2001 From: Marian Beermann Date: Thu, 2 Mar 2017 13:54:38 +0100 Subject: [PATCH 1/2] faq: I get an IntegrityError or similar - what now? --- docs/faq.rst | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/docs/faq.rst b/docs/faq.rst index a56c87c34..ae93039a4 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -108,6 +108,9 @@ Are there other known limitations? An easy workaround is to create multiple archives with less items each. See also the :ref:`archive_limitation` and :issue:`1452`. + :ref:`borg_info` shows how large (relative to the maximum size) existing + archives are. + Why is my backup bigger than with attic? ---------------------------------------- @@ -186,6 +189,46 @@ Yes, if you want to detect accidental data damage (like bit rot), use the If you want to be able to detect malicious tampering also, use an encrypted repo. It will then be able to check using CRCs and HMACs. +.. _faq-integrityerror: + +I get an IntegrityError or similar - what now? +---------------------------------------------- + +The first step should be to check whether it's a problem with the disk drive, +IntegrityErrors can be a sign of drive failure or other hardware issues. + +Using the smartmontools one can retrieve self-diagnostics of the drive in question +(where the repository is located, use *findmnt*, *mount* or *lsblk* to find the +*/dev/...* path of the drive):: + + # smartctl -a /dev/sdSomething + +Attributes that are a typical cause of data corruption are *Offline_Uncorrectable*, +*Current_Pending_Sector*, *Reported_Uncorrect*. A high *UDMA_CRC_Error_Count* usually +indicates a bad cable. If the *entire drive* is failing, then all data should be copied +off it as soon as possible. + +Some drives log IO errors, which are also logged by the system (refer to the journal/dmesg). +IO errors that impact only the filesystem can go unnoticed, since they are not reported +to applications (e.g. Borg), but can still corrupt data. + +If any of these are suspicious, a self-test is recommended:: + + # smartctl -t long /dev/sdSomething + +Running ``fsck`` if not done already might yield further insights. + +:ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with +issues. We recommend to first run without ``--repair`` to assess the situation and +if the found issues / proposed repairs sound right re-run it with ``--repair`` enabled. + +When errors are intermittent the cause might be bad memory, running memtest86+ or a similar +test is recommended. + +A single error does not indicate bad hardware or a Borg bug -- all hardware has a certain +bit error rate (BER), for hard drives this is typically specified as less than one error +every 12 to 120 TB (one bit error in 10e14 to 10e15 bits) and often called +*unrecoverable read error rate* (URE rate). Security ######## From 0bcb8c2a3908029d255701a647b523ba29d4e155 Mon Sep 17 00:00:00 2001 From: Marian Beermann Date: Fri, 19 May 2017 18:40:59 +0200 Subject: [PATCH 2/2] faq: rewrote IntegrityError answer --- docs/changes.rst | 3 ++ docs/faq.rst | 91 ++++++++++++++++++++++++++++++++++-------------- 2 files changed, 68 insertions(+), 26 deletions(-) diff --git a/docs/changes.rst b/docs/changes.rst index d8d9a3519..e52b23e5a 100644 --- a/docs/changes.rst +++ b/docs/changes.rst @@ -1,3 +1,6 @@ + +.. _important_notes: + Important notes =============== diff --git a/docs/faq.rst b/docs/faq.rst index ae93039a4..8424d91f5 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -194,41 +194,80 @@ repo. It will then be able to check using CRCs and HMACs. I get an IntegrityError or similar - what now? ---------------------------------------------- -The first step should be to check whether it's a problem with the disk drive, -IntegrityErrors can be a sign of drive failure or other hardware issues. +A single error does not necessarily indicate bad hardware or a Borg +bug. All hardware exhibits a bit error rate (BER). Hard drives are typically +specified as exhibiting less than one error every 12 to 120 TB +(one bit error in 10e14 to 10e15 bits). The specification is often called +*unrecoverable read error rate* (URE rate). -Using the smartmontools one can retrieve self-diagnostics of the drive in question -(where the repository is located, use *findmnt*, *mount* or *lsblk* to find the -*/dev/...* path of the drive):: +Apart from these very rare errors there are two main causes of errors: - # smartctl -a /dev/sdSomething +(i) Defective hardware: described below. +(ii) Bugs in software (Borg, operating system, libraries): + Ensure software is up to date. + Check whether the issue is caused by any fixed bugs described in :ref:`important_notes`. -Attributes that are a typical cause of data corruption are *Offline_Uncorrectable*, -*Current_Pending_Sector*, *Reported_Uncorrect*. A high *UDMA_CRC_Error_Count* usually -indicates a bad cable. If the *entire drive* is failing, then all data should be copied -off it as soon as possible. -Some drives log IO errors, which are also logged by the system (refer to the journal/dmesg). -IO errors that impact only the filesystem can go unnoticed, since they are not reported -to applications (e.g. Borg), but can still corrupt data. +.. rubric:: Finding defective hardware -If any of these are suspicious, a self-test is recommended:: +.. note:: - # smartctl -t long /dev/sdSomething + Hardware diagnostics are operating system dependent and do not + apply universally. The commands shown apply for popular Unix-like + systems. Refer to your operating system's manual. -Running ``fsck`` if not done already might yield further insights. +Checking hard drives + Find the drive containing the repository and use *findmnt*, *mount* or *lsblk* + to learn the device path (typically */dev/...*) of the drive. + Then, smartmontools can retrieve self-diagnostics of the drive in question:: + + # smartctl -a /dev/sdSomething + + The *Offline_Uncorrectable*, *Current_Pending_Sector* and *Reported_Uncorrect* + attributes indicate data corruption. A high *UDMA_CRC_Error_Count* usually + indicates a bad cable. + + I/O errors logged by the system (refer to the system journal or + dmesg) can point to issues as well. I/O errors only affecting the + file system easily go unnoticed, since they are not reported to + applications (e.g. Borg), while these errors can still corrupt data. + + Drives can corrupt some sectors in one event, while remaining + reliable otherwise. Conversely, drives can fail completely with no + advance warning. If in doubt, copy all data from the drive in + question to another drive -- just in case it fails completely. + + If any of these are suspicious, a self-test is recommended:: + + # smartctl -t long /dev/sdSomething + + Running ``fsck`` if not done already might yield further insights. + +Checking memory + Intermittent issues, such as ``borg check`` finding errors + inconsistently between runs, are frequently caused by bad memory. + + Run memtest86+ (or an equivalent memory tester) to verify that + the memory subsystem is operating correctly. + +Checking processors + Processors rarely cause errors. If they do, they are usually overclocked + or otherwise operated outside their specifications. We do not recommend to + operate hardware outside its specifications for productive use. + + Tools to verify correct processor operation include Prime95 (mprime), linpack, + and the `Intel Processor Diagnostic Tool + `_ + (applies only to Intel processors). + +.. rubric:: Repairing a damaged repository + +With any defective hardware found and replaced, the damage done to the repository +needs to be ascertained and fixed. :ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with -issues. We recommend to first run without ``--repair`` to assess the situation and -if the found issues / proposed repairs sound right re-run it with ``--repair`` enabled. - -When errors are intermittent the cause might be bad memory, running memtest86+ or a similar -test is recommended. - -A single error does not indicate bad hardware or a Borg bug -- all hardware has a certain -bit error rate (BER), for hard drives this is typically specified as less than one error -every 12 to 120 TB (one bit error in 10e14 to 10e15 bits) and often called -*unrecoverable read error rate* (URE rate). +issues. We recommend to first run without ``--repair`` to assess the situation. +If the found issues and proposed repairs seem right, re-run "check" with ``--repair`` enabled. Security ########