linux-ext4 - [Bug 200753] write I/O error for inode structure leads to operation failure without any warning or error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bug-200753-13602-LJwn6RdXVd@https.bugzilla.kernel.org/>
Date:   Tue, 07 Aug 2018 03:33:48 +0000
From:   bugzilla-daemon@...zilla.kernel.org
To:     linux-ext4@...nel.org
Subject: [Bug 200753] write I/O error for inode structure leads to operation
 failure without any warning or error

https://bugzilla.kernel.org/show_bug.cgi?id=200753

--- Comment #9 from Theodore Tso (tytso@....edu) ---
For writes, userspace must call fsync(2) and check the error returns from
fsync(2), write(2), and close(2), if it wants to be sure of catching the error
report.   (For some remote file systems, such as AFS, the error reporting a
quota overflow happens on write, since it's only on the close that the file is
sent to the server.)

If you don't call fsync(2), the data blocks (or changes to the inode) may not
be attempted to be written to disk before the userspace program exits so there
is no guarantee that would be any opportunity for the system even *notice* that
there is a problem.

Also, as far as metadata blocks (such as inode table blocks), what's generally
important is whether they are successfully written to the journal.   That's
because in real life there are two cases where we have errors the *vast*
majority of time.  (a) The device has disappeared on us, because it's been
unplugged from the computer or the last fibre channel connection between the
computer and the disk has been lost, etc.   (b)  There is a media error.

For (a) so long as the writes have made it to the journal, that's what is
important.  If the disk has disappeared, then when it comes back, we will
replay the journal, and the inode table updates will be written.

For (b), in general with modern storage devices, there is a bad block
replacement pool, and writes will use a newly allocated block from the bad
block sparing pool if there is a problem with the recording error, and this is
transparent to the host software.

How you are modelling errors by using a device-mapper target to force-fail
certain blocks permanently might reflect how disks behaved in the early 1980's
on PC's (e.g., pre-IDE and pre-SATA), but doesn't reflect how storage devices
behave today.

One could argue that ext4 should do the right thing even when using hardware
which is 35+ years old.  The problem is, for example, if we forced the disk to
actually try to persist writes after each inode update in fsck, we would
destroy performance.   You can try simulate this by hacking e2fsck to force the
use of O_DIRECT reads and writes (which eliminate buffering, so each read and
write call results in a synchronous I/O request to the device).   You will find
that the results are not pretty.  Hence, trading off a performance disaster to
make some academic who is writing a paper about whether or not file systems
handle artificial I/O error injections that do not comport with reality is
really not something I'm particularly interested in.....

-- 
You are receiving this mail because:
You are watching the assignee of the bug.