linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <969BEE75-323B-4331-8E09-60AA3E662EC6@flyingcircus.io>
Date: Fri, 13 Sep 2024 00:11:14 +0200
From: Christian Theune <ct@...ingcircus.io>
To: Matthew Wilcox <willy@...radead.org>
Cc: linux-mm@...ck.org,
 "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
 linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 torvalds@...ux-foundation.org,
 axboe@...nel.dk,
 Daniel Dao <dqminh@...udflare.com>,
 Dave Chinner <david@...morbit.com>,
 clm@...a.com,
 regressions@...ts.linux.dev,
 regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

Hi Matthew,

> On 12. Sep 2024, at 23:55, Matthew Wilcox <willy@...radead.org> wrote:
> 
> On Thu, Sep 12, 2024 at 11:18:34PM +0200, Christian Theune wrote:
>> This bug is very hard to reproduce but has been known to exist as a
>> “fluke” for a while already. I have invested a number of days trying
>> to come up with workloads to trigger it quicker than that stochastic
>> “once every few weeks in a fleet of 1.5k machines", but it eludes
>> me so far. I know that this also affects Facebook/Meta as well as
>> Cloudflare who are both running newer kernels (at least 6.1, 6.6,
>> and 6.9) with the above mentioned patch reverted. I’m from a much
>> smaller company and seeing that those guys are running with this patch
>> reverted (that now makes their kernel basically an untested/unsupported
>> deviation from the mainline) smells like desparation. I’m with a
>> much smaller team and company and I’m wondering why this isn’t
>> tackled more urgently from more hands to make it shallow (hopefully).
> 
> This passive-aggressive nonsense is deeply aggravating.  I've known
> about this bug for much longer, but like you I am utterly unable to
> reproduce it.  I've spent months looking for the bug, and I cannot.

I’m sorry. I’ve honestly tried my best to not make this message personally injuring to anybody involved while trying to also communicate the seriousness of this issue that we’re stuck with. Apparently I failed. 

As I’m not a kernel developer I tried to stick to describing the issue and am not sure what strategies would typically need to be applied when individual efforts fail. 

I’m not sure why it’s nonsense, though.

Liebe Grüße,
Christian Theune

-- 
Christian Theune · ct@...ingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick