linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <15f15df9-ec90-486a-a784-effb8b2cb292@meta.com>
Date: Tue, 24 Sep 2024 21:17:04 +0200
From: Chris Mason <clm@...a.com>
To: Matthew Wilcox <willy@...radead.org>, Jens Axboe <axboe@...nel.dk>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
        Dave Chinner <david@...morbit.com>,
        Christian Theune <ct@...ingcircus.io>, linux-mm@...ck.org,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
        regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On 9/20/24 3:54 PM, Chris Mason wrote:

[ ... ]

> xas_split_alloc() does the allocation and also shoves an entry into some of
> the slots.  When the tree changes, the entry we've stored is wildly 
> wrong, but xas_reset() doesn't undo any of that.  So when we actually
> use the xas->xa_alloc nodes we've setup, they are pointing to the
> wrong things.
> 
> Which is probably why the commits in 6.10 added this:
> 
> /* entry may have changed before we re-acquire the lock */
> if (alloced_order && (old != alloced_shadow || order != alloced_order)) {
> 	xas_destroy(&xas);
>         alloced_order = 0;
> }
> 
> The only way to undo the work done by xas_split_alloc() is to call
> xas_destroy().
> 
> To prove this theory, I tried making a minimal version that also
> called destroy, but it all ended up less minimal than the code
> that's actually in 6.10.  I've got a long test going now with
> an extra cond_resched() to make the race bigger, and a printk of victory.
> 
> It hasn't fired yet, and I need to hop on an airplane, so I'll just leave
> it running for now.  But long story short, I think we should probably
> just tag all of these for stable:
> 
> https://lore.kernel.org/all/20240415171857.19244-2-ryncsn@gmail.com/T/#mdb85922624c39ea7efb775a044af4731890ff776
> 
> Also, Willy's proposed changes to xas_split_alloc() seem like a good
> idea.

A few days of load later and some extra printks, it turns out that
taking the writer lock in __filemap_add_folio() makes us dramatically
more likely to just return EEXIST than go into the xas_split_alloc() dance.

With the changes in 6.10, we only get into that xas_destroy() case above
when the conflicting entry is a shadow entry, so I changed my repro to
use memory pressure instead of fadvise.

I also added a schedule_timeout(1) after the split alloc, and with all
of that I'm able to consistently make the xas_destroy() case trigger
without causing any system instability.  Kairui Song's patches do seem
to have fixed things nicely.

-chris