linux-ext4 - Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dkjq2c57du34wq7ocvtk37a5gkcondxfedgnbdxse55nhlfioy@v6tx45lkopfm>
Date: Tue, 15 Apr 2025 18:23:54 +0200
From: Jan Kara <jack@...e.cz>
To: Luis Chamberlain <mcgrof@...nel.org>
Cc: Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	tytso@....edu, adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org, 
	riel@...riel.com, dave@...olabs.net, willy@...radead.org, hannes@...xchg.org, 
	oliver.sang@...el.com, david@...hat.com, axboe@...nel.dk, hare@...e.de, 
	david@...morbit.com, djwong@...nel.org, ritesh.list@...il.com, 
	linux-fsdevel@...r.kernel.org, linux-block@...r.kernel.org, linux-mm@...ck.org, 
	gost.dev@...sung.com, p.raghav@...sung.com, da.gomez@...sung.com, 
	syzbot+f3c6fda1297c748a7076@...kaller.appspotmail.com
Subject: Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on
 migration

On Tue 15-04-25 08:47:51, Luis Chamberlain wrote:
> On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> > On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > > >  			}
> > > > > >  			bh = bh->b_this_page;
> > > > > >  		} while (bh != head);
> > > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > > 
> > > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > > You can reduce the spinlock critical section only after providing
> > > > > alternative way to protect them from migration. So this should probably
> > > > > happen at the end of the series.
> > > > 
> > > > So you're OK with this spin lock move with the other series in place?
> > > > 
> > > > And so we punt the hard-to-reproduce corruption issue as future work
> > > > to do? Becuase the other alternative for now is to just disable
> > > > migration for jbd2:
> > > > 
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > > >  	.bmap			= ext4_bmap,
> > > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > > >  	.release_folio		= ext4_release_folio,
> > > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > > >  	.error_remove_folio	= generic_error_remove_folio,
> > > >  	.swap_activate		= ext4_iomap_swap_activate,
> > > 
> > > BTW I ask because.. are your expectations that the next v3 series also
> > > be a target for Linus tree as part of a fix for this spinlock
> > > replacement?
> > 
> > Since this is fixing potential filesystem corruption I will upstream
> > whatever we need to do to fix this. Ideally we have a minimal fix to
> > upstream now and a comprehensive fix and cleanup for v6.16.
> 
> Despite our efforts we don't yet have an agreement on how to fix the
> ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
> is too broad and would affect other filesystems (I have yet to
> understand how, but will review).
> 
> And so while we have agreement we can remove the spin lock to fix the
> sleeping while atomic incurred by large folios for buffer heads by this
> patch series, the removal of the spin lock would happen at the end of
> this series.
> 
> And so the ext4 corruption is an existing issue as-is today, its
> separate from the spin lock removal goal to fix the sleeping while
> atomic..

I agree. Ext4 corruption problems are separate from sleeping in atomic
issues.

> However this series might be quite big for an rc2 or rc3 fix for that spin
> lock removal issue. It should bring in substantial performance benefits
> though, so it might be worthy to consider. We can re-run tests with the
> adjustment to remove the spin lock until the last patch in this series.
> 
> The alternative is to revert the spin lock addition commit for Linus'
> tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
> between __find_get_block() and migration") and note that it in fact does
> not fix the ext4 corruption as we've noted, and in fact causes an issue
> with sleeping while atomic with support for large folios for buffer
> heads. If we do that then we  punt this series for the next development
> window, and it would just not have the spin lock removal on the last
> patch.

Well, the commit ebdf4de5642fb6 is 6 years old. At that time there were no
large folios (in fact there were no folios at all ;)) in the page cache and
it does work quite well (I didn't see a corruption report from real users
since then). So I don't like removing that commit because it makes a
"reproducible with a heavy stress test" problem become a "reproduced by
real world workloads" problem.

If you look for a fast way to fixup sleep in atomic issues, then I'd
suggest just disabling large pages for block device page cache. That is the
new functionality that actually triggered all these investigations and
sleep-in-atomic reports. And once this patch set gets merged, we can
reenable large folios in the block device page cache again.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR