linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZusKGCXhponXOh_l@casper.infradead.org>
Date: Wed, 18 Sep 2024 18:12:56 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Chris Mason <clm@...a.com>, Jens Axboe <axboe@...nel.dk>,
	Dave Chinner <david@...morbit.com>,
	Christian Theune <ct@...ingcircus.io>, linux-mm@...ck.org,
	"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
	regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On Wed, Sep 18, 2024 at 04:39:56PM +0200, Linus Torvalds wrote:
> The fact that this bug was fixed basically entirely by mistake does
> say "this is much too subtle".

Yup.

> Of course, the fact that an xas_reset() not only resets the walk, but
> also clears any pending errors (because it's all the same "xa_node"
> thing), doesn't make things more obvious. Because right now you
> *could* treat errors as "cumulative", but if a xas_split_alloc() does
> an xas_reset() on success, that means that it's actually a big
> conceptual change and you can't do the "cumulative" thing any more.

So ... the way xas was intended to work is that the first thing we did
that set an error meant that everything after it was a no-op.  You
can see that in functions like xas_start() which do:

        if (xas_error(xas))
                return NULL;

obviously something like xas_unlock() isn't a noop because you still
want to unlock even if you had an error.

The xas_split_alloc() was done in too much of a hurry.  I had thought
that I wouldn't need it, and then found out that it was a prerequisite
for something I needed to do, and so I wasn't in the right frame of mind
when I wrote it.

It's actually a giant pain and I wanted to redo it even before this, as
well as clear up some pieces from xas_nomem() / __xas_nomem().  The
restriction on "we can only split to one additional level" is awful,
and has caused some contortions elsewhere.

> End result: it would probably make sense to change "xas_split_alloc()"
> to explicitly *not* have that "check xas_error() afterwards as if it
> could be cumulative", and instead make it very clearly have no history
> and change the semantics to

What it really should do is just return if it's already in an error state.
That makes it consistent with the rest of the API, and we don't have to
worry about it losing an already-found error.

But also all the other infelicities with it need to be fixed.