1
0
Fork 0
mirror of https://github.com/borgbackup/borg.git synced 2025-03-10 14:15:43 +00:00

Merge pull request #2529 from enkore/faq/errors

faq: I get an IntegrityError or similar - what now?
This commit is contained in:
enkore 2017-05-22 20:34:59 +02:00 committed by GitHub
commit decee5389b
2 changed files with 85 additions and 0 deletions

View file

@ -1,3 +1,6 @@
.. _important_notes:
Important notes
===============

View file

@ -108,6 +108,9 @@ Are there other known limitations?
An easy workaround is to create multiple archives with less items each.
See also the :ref:`archive_limitation` and :issue:`1452`.
:ref:`borg_info` shows how large (relative to the maximum size) existing
archives are.
Why is my backup bigger than with attic?
----------------------------------------
@ -186,6 +189,85 @@ Yes, if you want to detect accidental data damage (like bit rot), use the
If you want to be able to detect malicious tampering also, use an encrypted
repo. It will then be able to check using CRCs and HMACs.
.. _faq-integrityerror:
I get an IntegrityError or similar - what now?
----------------------------------------------
A single error does not necessarily indicate bad hardware or a Borg
bug. All hardware exhibits a bit error rate (BER). Hard drives are typically
specified as exhibiting less than one error every 12 to 120 TB
(one bit error in 10e14 to 10e15 bits). The specification is often called
*unrecoverable read error rate* (URE rate).
Apart from these very rare errors there are two main causes of errors:
(i) Defective hardware: described below.
(ii) Bugs in software (Borg, operating system, libraries):
Ensure software is up to date.
Check whether the issue is caused by any fixed bugs described in :ref:`important_notes`.
.. rubric:: Finding defective hardware
.. note::
Hardware diagnostics are operating system dependent and do not
apply universally. The commands shown apply for popular Unix-like
systems. Refer to your operating system's manual.
Checking hard drives
Find the drive containing the repository and use *findmnt*, *mount* or *lsblk*
to learn the device path (typically */dev/...*) of the drive.
Then, smartmontools can retrieve self-diagnostics of the drive in question::
# smartctl -a /dev/sdSomething
The *Offline_Uncorrectable*, *Current_Pending_Sector* and *Reported_Uncorrect*
attributes indicate data corruption. A high *UDMA_CRC_Error_Count* usually
indicates a bad cable.
I/O errors logged by the system (refer to the system journal or
dmesg) can point to issues as well. I/O errors only affecting the
file system easily go unnoticed, since they are not reported to
applications (e.g. Borg), while these errors can still corrupt data.
Drives can corrupt some sectors in one event, while remaining
reliable otherwise. Conversely, drives can fail completely with no
advance warning. If in doubt, copy all data from the drive in
question to another drive -- just in case it fails completely.
If any of these are suspicious, a self-test is recommended::
# smartctl -t long /dev/sdSomething
Running ``fsck`` if not done already might yield further insights.
Checking memory
Intermittent issues, such as ``borg check`` finding errors
inconsistently between runs, are frequently caused by bad memory.
Run memtest86+ (or an equivalent memory tester) to verify that
the memory subsystem is operating correctly.
Checking processors
Processors rarely cause errors. If they do, they are usually overclocked
or otherwise operated outside their specifications. We do not recommend to
operate hardware outside its specifications for productive use.
Tools to verify correct processor operation include Prime95 (mprime), linpack,
and the `Intel Processor Diagnostic Tool
<https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool>`_
(applies only to Intel processors).
.. rubric:: Repairing a damaged repository
With any defective hardware found and replaced, the damage done to the repository
needs to be ascertained and fixed.
:ref:`borg_check` provides diagnostics and ``--repair`` options for repositories with
issues. We recommend to first run without ``--repair`` to assess the situation.
If the found issues and proposed repairs seem right, re-run "check" with ``--repair`` enabled.
Security
########