linux-ext4 - Re: ext4 damage suspected in between 5.15.167

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ce9055d7-7301-0abe-3609-3a4e2e7b1e5e@gmail.com>
Date: Sat, 14 Dec 2024 22:58:24 +0300
From: Nikolai Zhubr <zhubr.2@...il.com>
To: Theodore Ts'o <tytso@....edu>
Cc: linux-ext4@...r.kernel.org, stable@...r.kernel.org,
 linux-kernel@...r.kernel.org, jack@...e.cz
Subject: Re: ext4 damage suspected in between 5.15.167 - 5.15.170

Hi Ted,

On 12/13/24 19:12, Theodore Ts'o wrote:
> stable@...nel.org" to the commit description.  However, they are not
> obligated to do that, so there is an auxillary system which uses AI to
> intuit which patches might be a bug fix.  There is also automated
> systems that try to automatically figure out which patches might be

Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for 
pointing out.

> Note that some hardware errors can be caused by one-off errors, such
> as cosmic rays causing a bit-flip in memory DIMM.  If that happens,
> RAID won't save you, since the error was introduced before an updated

Certainly cosmic rays is a possibility, but based on previous episodes 
I'd still rather bet on a more usual "subtle interaction" problem, 
either exact same or some similar to [1].
I even tried to run an existing test for this particular case as 
described in [2] but it is not too user-friendly and somehow exits 
abnormally without actually doing any interesting work. I'll get back to 
it later when I have some time.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
[2] https://lwn.net/Articles/954364/

> The location of block allocation bitmaps never gets changed, so this
> sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random 
wrong offsets, like in [1] above, or something similar.

> Looking at the dumpe2fs output, it looks like it was created
> relatively recently (July 2024) but it doesn't have the metadata
> checksum feature enabled, which has been enabled for quite a long

Yes. That was intentional - for better compatibility with even more 
ancient stuff. Maybe time has come to reconsider the approach though.

> You got lucky because it block allocation bitmap location was
> corrupted to an obviously invalid value.  But if it had been a

Absolutely. I was really amazed when I realized that :-)
It saved me days or even weeks of unnecessary verification work.

> Otherwise, I strongly encourage you to learn, and to take
> responsibility for the health of your own system.  And ideally, you
> can also use that knowledge to help other users out, which is the only
> way the free-as-in-beer ecosystem can flurish; by having everybody

True. Generally I try to follow that, as much as appears possible.
It is sad a direct communication end-user-to-developer for solving 
issues is becoming increasingly problematic here.
Anyway, thank you for friendly speech, useful hints and good references!

Regards,

Nick

> helping each other.  Who knows, maybe you could even get a job doing
> it for a living.  :-) :-) :-)
> 
> Cheers,
>