linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZuSBPrN2CbWMlr3f@casper.infradead.org>
Date: Fri, 13 Sep 2024 19:15:26 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Chris Mason <clm@...a.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	Jens Axboe <axboe@...nel.dk>, Christian Theune <ct@...ingcircus.io>,
	linux-mm@...ck.org,
	"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Daniel Dao <dqminh@...udflare.com>,
	Dave Chinner <david@...morbit.com>, regressions@...ts.linux.dev,
	regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On Fri, Sep 13, 2024 at 12:33:49PM -0400, Chris Mason wrote:
> > If you could get the precise index numbers, that would be an important
> > clue.  It would be interesting to know the index number in the xarray
> > where the folio was found rather than folio->index (as I suspect that
> > folio->index is completely bogus because folio->mapping is wrong).
> > But gathering that info is going to be hard.
> 
> This particular debug session was late at night while we were urgently
> trying to roll out some NFS features.  I didn't really save many of the
> details because my plan was to reproduce it and make a full bug report.
> 
> Also, I was explaining the details to people in workplace chat, which is
> wildly bad at rendering long lines of structured text, especially when
> half the people in the chat are on a mobile device.
> 
> You're probably wondering why all of that is important...what I'm really
> trying to say is that I've attached a screenshot of the debugging output.
> 
> It came from a older drgn script, where I'm still clinging to "radix",
> and you probably can't trust the string representation of the page flags
> because I wasn't yet using Omar's helpers and may have hard coded them
> from an older kernel.

That's all _fine_.  This is enormously helpful.

First, we see the same folio appear three times.  I think that's
particularly significant.  Modulo 64 (number of entries/node), the indices
the bad folio are found at is 16, 32 and 48.  So I think the _current_
order of folio is 4, but at the time the folio was put in the xarray,
it was order 6.  Except ... at order-6 we elide a level of the xarray.
So we shouldn't be able to see this.  Hm.

Oh!  I think split is the key.  Let's say we have an order-6 (or
larger) folio.  And we call split_huge_page() (whatever it's called
in your kernel version).  That calls xas_split_alloc() followed
by xas_split().  xas_split_alloc() puts entry in node->slots[0] and
initialises node->slots[1..XA_CHUNK_SIZE] to a sibling entry.

Now, if we do allocate those node in xas_split_alloc(), we're supposed to
free them with radix_tree_node_rcu_free() which zeroes all the slots.
But what if we don't, somehow?  (this is my best current theory).
Then we allocate the node to a different tree, but any time we try to
look something up, unless it's the index for which we allocated the node,
we find a sibling entry and it points to a stale pointer.

I'm going to think on this a bit more, but so far this is all good
evidence for my leading theory.