mirror of
https://github.com/borgbackup/borg.git
synced 2024-12-25 01:06:50 +00:00
faq: rewrote IntegrityError answer
This commit is contained in:
parent
be40e2fcfa
commit
0bcb8c2a39
2 changed files with 68 additions and 26 deletions
|
@ -1,3 +1,6 @@
|
||||||
|
|
||||||
|
.. _important_notes:
|
||||||
|
|
||||||
Important notes
|
Important notes
|
||||||
===============
|
===============
|
||||||
|
|
||||||
|
|
87
docs/faq.rst
87
docs/faq.rst
|
@ -194,41 +194,80 @@ repo. It will then be able to check using CRCs and HMACs.
|
||||||
I get an IntegrityError or similar - what now?
|
I get an IntegrityError or similar - what now?
|
||||||
----------------------------------------------
|
----------------------------------------------
|
||||||
|
|
||||||
The first step should be to check whether it's a problem with the disk drive,
|
A single error does not necessarily indicate bad hardware or a Borg
|
||||||
IntegrityErrors can be a sign of drive failure or other hardware issues.
|
bug. All hardware exhibits a bit error rate (BER). Hard drives are typically
|
||||||
|
specified as exhibiting less than one error every 12 to 120 TB
|
||||||
|
(one bit error in 10e14 to 10e15 bits). The specification is often called
|
||||||
|
*unrecoverable read error rate* (URE rate).
|
||||||
|
|
||||||
Using the smartmontools one can retrieve self-diagnostics of the drive in question
|
Apart from these very rare errors there are two main causes of errors:
|
||||||
(where the repository is located, use *findmnt*, *mount* or *lsblk* to find the
|
|
||||||
*/dev/...* path of the drive)::
|
(i) Defective hardware: described below.
|
||||||
|
(ii) Bugs in software (Borg, operating system, libraries):
|
||||||
|
Ensure software is up to date.
|
||||||
|
Check whether the issue is caused by any fixed bugs described in :ref:`important_notes`.
|
||||||
|
|
||||||
|
|
||||||
|
.. rubric:: Finding defective hardware
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Hardware diagnostics are operating system dependent and do not
|
||||||
|
apply universally. The commands shown apply for popular Unix-like
|
||||||
|
systems. Refer to your operating system's manual.
|
||||||
|
|
||||||
|
Checking hard drives
|
||||||
|
Find the drive containing the repository and use *findmnt*, *mount* or *lsblk*
|
||||||
|
to learn the device path (typically */dev/...*) of the drive.
|
||||||
|
Then, smartmontools can retrieve self-diagnostics of the drive in question::
|
||||||
|
|
||||||
# smartctl -a /dev/sdSomething
|
# smartctl -a /dev/sdSomething
|
||||||
|
|
||||||
Attributes that are a typical cause of data corruption are *Offline_Uncorrectable*,
|
The *Offline_Uncorrectable*, *Current_Pending_Sector* and *Reported_Uncorrect*
|
||||||
*Current_Pending_Sector*, *Reported_Uncorrect*. A high *UDMA_CRC_Error_Count* usually
|
attributes indicate data corruption. A high *UDMA_CRC_Error_Count* usually
|
||||||
indicates a bad cable. If the *entire drive* is failing, then all data should be copied
|
indicates a bad cable.
|
||||||
off it as soon as possible.
|
|
||||||
|
|
||||||
Some drives log IO errors, which are also logged by the system (refer to the journal/dmesg).
|
I/O errors logged by the system (refer to the system journal or
|
||||||
IO errors that impact only the filesystem can go unnoticed, since they are not reported
|
dmesg) can point to issues as well. I/O errors only affecting the
|
||||||
to applications (e.g. Borg), but can still corrupt data.
|
file system easily go unnoticed, since they are not reported to
|
||||||
|
applications (e.g. Borg), while these errors can still corrupt data.
|
||||||
|
|
||||||
If any of these are suspicious, a self-test is recommended::
|
Drives can corrupt some sectors in one event, while remaining
|
||||||
|
reliable otherwise. Conversely, drives can fail completely with no
|
||||||
|
advance warning. If in doubt, copy all data from the drive in
|
||||||
|
question to another drive -- just in case it fails completely.
|
||||||
|
|
||||||
|
If any of these are suspicious, a self-test is recommended::
|
||||||
|
|
||||||
# smartctl -t long /dev/sdSomething
|
# smartctl -t long /dev/sdSomething
|
||||||
|
|
||||||
Running ``fsck`` if not done already might yield further insights.
|
Running ``fsck`` if not done already might yield further insights.
|
||||||
|
|
||||||
|
Checking memory
|
||||||
|
Intermittent issues, such as ``borg check`` finding errors
|
||||||
|
inconsistently between runs, are frequently caused by bad memory.
|
||||||
|
|
||||||
|
Run memtest86+ (or an equivalent memory tester) to verify that
|
||||||
|
the memory subsystem is operating correctly.
|
||||||
|
|
||||||
|
Checking processors
|
||||||
|
Processors rarely cause errors. If they do, they are usually overclocked
|
||||||
|
or otherwise operated outside their specifications. We do not recommend to
|
||||||
|
operate hardware outside its specifications for productive use.
|
||||||
|
|
||||||
|
Tools to verify correct processor operation include Prime95 (mprime), linpack,
|
||||||
|
and the `Intel Processor Diagnostic Tool
|
||||||
|
<https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool>`_
|
||||||
|
(applies only to Intel processors).
|
||||||
|
|
||||||
|
.. rubric:: Repairing a damaged repository
|
||||||
|
|
||||||
|
With any defective hardware found and replaced, the damage done to the repository
|
||||||
|
needs to be ascertained and fixed.
|
||||||
|
|
||||||
:ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with
|
:ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with
|
||||||
issues. We recommend to first run without ``--repair`` to assess the situation and
|
issues. We recommend to first run without ``--repair`` to assess the situation.
|
||||||
if the found issues / proposed repairs sound right re-run it with ``--repair`` enabled.
|
If the found issues and proposed repairs seem right, re-run "check" with ``--repair`` enabled.
|
||||||
|
|
||||||
When errors are intermittent the cause might be bad memory, running memtest86+ or a similar
|
|
||||||
test is recommended.
|
|
||||||
|
|
||||||
A single error does not indicate bad hardware or a Borg bug -- all hardware has a certain
|
|
||||||
bit error rate (BER), for hard drives this is typically specified as less than one error
|
|
||||||
every 12 to 120 TB (one bit error in 10e14 to 10e15 bits) and often called
|
|
||||||
*unrecoverable read error rate* (URE rate).
|
|
||||||
|
|
||||||
Security
|
Security
|
||||||
########
|
########
|
||||||
|
|
Loading…
Reference in a new issue