linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjty_0NfiZn2HVzT0Ye-RR09+Rqbd1azwJLOTJrX+V5MQ@mail.gmail.com>
Date: Sat, 12 Oct 2024 10:01:08 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Chris Mason <clm@...a.com>
Cc: Christian Theune <ct@...ingcircus.io>, Dave Chinner <david@...morbit.com>, 
	Matthew Wilcox <willy@...radead.org>, Jens Axboe <axboe@...nel.dk>, linux-mm@...ck.org, 
	"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Daniel Dao <dqminh@...udflare.com>, 
	regressions@...ts.linux.dev, regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On Fri, 11 Oct 2024 at 06:06, Chris Mason <clm@...a.com> wrote:
>
> - Linus's starvation observation.  It doesn't feel like there's enough
> load to cause this, especially given us sitting in truncate, where it
> should be pretty unlikely to have multiple procs banging on the page in
> question.

Yeah, I think the starvation can only possibly happen in
fdatasync-like paths where it's waiting for existing writeback without
holding the page lock. And while Christian has had those backtraces
too, the truncate path is not one of them.

That said, just because I wanted to see how nasty it is, I looked into
changing the rules for folio_wake_bit().

Christian, just to clarify, this is not for  you to test - this is
very experimental - but maybe Willy has comments on it.

Because it *might* be possible to do something like the attached,
where we do the page flags changes atomically but without any locks if
there are no waiters, but if there is a waiter on the page, we always
clear the page flag bit atomically under the waitqueue lock as we wake
up the waiter.

I changed the name (and the return value) of the
folio_xor_flags_has_waiters() function to just not have any
possibility of semantic mixup, but basically instead of doing the xor
atomically and unconditionally (and returning whether we had waiters),
it now does it conditionally only if we do *not* have waiters, and
returns true if successful.

And if there were waiters, it moves the flag clearing into the wakeup function.

That in turn means that the "while whiteback" loop can go back to be
just a non-looping "if writeback", and folio_wait_writeback() can't
get into any starvation with new writebacks always showing up.

The reason I say it *might* be possible to do something like this is
that it changes __folio_end_writeback() to no longer necessarily clear
the writeback bit under the XA lock. If there are waiters, we'll clear
it later (after releasing the lock) in the caller.

Willy? What do you think? Clearly this now makes PG_writeback not
synchronized with the PAGECACHE_TAG_WRITEBACK tag, but the reason I
think it might be ok is that the code that *sets* the PG_writeback bit
in __folio_start_writeback() only ever starts with a page that isn't
under writeback, and has a

        VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);

at the top of the function even outside the XA lock. So I don't think
these *need* to be synchronized under the XA lock, and I think the
folio flag wakeup atomicity might be more important than the XA
writeback tag vs folio writeback bit.

But I'm not going to really argue for this patch at all - I wanted to
look at how bad it was, I wrote it, I'm actually running it on my
machine now and it didn't *immediately* blow up in my face, so it
*may* work just fine.

The patch is fairly simple, and apart from the XA tagging issue is
seems very straightforward. I'm just not sure it's worth synchronizing
one part just to at the same time de-synchronize another..

                   Linus

View attachment "0001-Test-atomic-folio-bit-waiting.patch" of type "text/x-patch" (5519 bytes)