linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <0C392D79-DAB1-4730-B2AB-B2B8CF100F11@flyingcircus.io>
Date: Wed, 18 Sep 2024 10:31:09 +0200
From: Christian Theune <ct@...ingcircus.io>
To: Dave Chinner <david@...morbit.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
 Jens Axboe <axboe@...nel.dk>,
 Matthew Wilcox <willy@...radead.org>,
 linux-mm@...ck.org,
 "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
 linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 Daniel Dao <dqminh@...udflare.com>,
 clm@...a.com,
 regressions@...ts.linux.dev,
 regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)



> On 16. Sep 2024, at 09:14, Christian Theune <ct@...ingcircus.io> wrote:
> 
>> 
>> On 16. Sep 2024, at 02:00, Dave Chinner <david@...morbit.com> wrote:
>> 
>> I don't think this is a data corruption/loss problem - it certainly
>> hasn't ever appeared that way to me.  The "data loss" appeared to be
>> in incomplete postgres dump files after the system was rebooted and
>> this is exactly what would happen when you randomly crash the
>> system. i.e. dirty data in memory is lost, and application data
>> being written at the time is in an inconsistent state after the
>> system recovers. IOWs, there was no clear evidence of actual data
>> corruption occuring, and data loss is definitely expected when the
>> page cache iteration hangs and the system is forcibly rebooted
>> without being able to sync or unmount the filesystems…
>> All the hangs seem to be caused by folio lookup getting stuck
>> on a rogue xarray entry in truncate or readahead. If we find an
>> invalid entry or a folio from a different mapping or with a
>> unexpected index, we skip it and try again.  Hence this does not
>> appear to be a data corruption vector, either - it results in a
>> livelock from endless retry because of the bad entry in the xarray.
>> This endless retry livelock appears to be what is being reported.
>> 
>> IOWs, there is no evidence of real runtime data corruption or loss
>> from this pagecache livelock bug.  We also haven't heard of any
>> random file data corruption events since we've enabled large folios
>> on XFS. Hence there really is no evidence to indicate that there is
>> a large folio xarray lookup bug that results in data corruption in
>> the existing code, and therefore there is no obvious reason for
>> turning off the functionality we are already building significant
>> new functionality on top of.

I’ve been chewing more on this and reviewed the tickets I have. We did see a PostgreSQL database ending up reporting "ERROR: invalid page in block 30896 of relation base/16389/103292”. 

My understanding of the argument that this bug does not corrupt data is that the error would only lead to a crash-consistent state. So applications that can properly recover from a crash-consistent state would only experience data loss to the point of the crash (which is fine and expected) but should not end up in a further corrupted state.

PostgreSQL reporting this error indicates - to my knowledge - that it did not see a crash consistent state of the file system.

Christian

-- 
Christian Theune · ct@...ingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick