linux-kernel - Re: ext4 damage suspected in between 5.15.167

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20241216193104.GB78919@mit.edu>
Date: Mon, 16 Dec 2024 14:31:04 -0500
From: "Theodore Ts'o" <tytso@....edu>
To: David Laight <David.Laight@...lab.com>
Cc: "'Nikolai Zhubr'" <zhubr.2@...il.com>,
        "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "jack@...e.cz" <jack@...e.cz>
Subject: Re: ext4 damage suspected in between 5.15.167 - 5.15.170

On Mon, Dec 16, 2024 at 03:16:00PM +0000, David Laight wrote:
> ....
> > > The location of block allocation bitmaps never gets changed, so this
> > > sort of thing only happens due to hardware-induced corruption.
> > 
> > Well, unless e.g. some modified sectors start being flushed to random
> > wrong offsets, like in [1] above, or something similar.

Well in the bug that you referenced in [1], what was happening was
that data could get written to the wrong offset in the file under
certain race conditions.  This would not be the case of data block
getting written over some metadata block like the block group
descriptors.

Sectors getting written to the wrong LBA's do happen; there's a reason
why enterprise databases include a checksum in every 4k database
block.  But the root cause of that generally tends to be a bit getting
flipped in the LBA number when it is being sent from the CPU to the
Controller to the storage device.  It's rare, but when it does happen,
it is more often than not hardware-induced --- and again, one of those
things where RAID won't necessarily save you.

> Or cutting the power in the middle of SSD 'wear levelling'.
> 
> I've seen a completely trashed disk (sectors in completely the
> wrong places) after an unexpected power cut.

Sure, but that falls in the category of hardware-induced corruption.
There have been non-power-fail certified SSD which have their flash
translation metadata so badly corrupted that you lose everything
(there's a reason why professional photographers use dual SDcard
slots, and some may use duct tape to make sure the battery access door
won't fly open if their camera gets dropped).

					- Ted