linux-ext4 - Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dpn6pb7hwpmajoh5k5zla6x7fsmh4rlttstj3hkuvunp6tok3j@ikz2fxpikfv4>
Date: Thu, 10 Apr 2025 14:05:38 +0200
From: Jan Kara <jack@...e.cz>
To: Luis Chamberlain <mcgrof@...nel.org>
Cc: brauner@...nel.org, jack@...e.cz, tytso@....edu, 
	adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org, riel@...riel.com, dave@...olabs.net, 
	willy@...radead.org, hannes@...xchg.org, oliver.sang@...el.com, david@...hat.com, 
	axboe@...nel.dk, hare@...e.de, david@...morbit.com, djwong@...nel.org, 
	ritesh.list@...il.com, linux-fsdevel@...r.kernel.org, linux-block@...r.kernel.org, 
	linux-mm@...ck.org, gost.dev@...sung.com, p.raghav@...sung.com, da.gomez@...sung.com, 
	syzbot+f3c6fda1297c748a7076@...kaller.appspotmail.com
Subject: Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on
 migration

On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> Filesystems which use buffer-heads where it cannot guarantees that there
> are no other references to the folio, for example with a folio
> lock, must use buffer_migrate_folio_norefs() for the address space
> mapping migrate_folio() callback. There are only 3 filesystems which use
> this callback:
> 
>   1) the block device cache

Well, but through this also all simple filesystems that use buffer_heads
for metadata handling...

>   2) ext4 for its ext4_journalled_aops, ie, jbd2
>   3) nilfs2
> 
> jbd2's use of this however callback however is very race prone, consider
> folio migration while reviewing jbd2_journal_write_metadata_buffer()
> and the fact that jbd2:
> 
>   - does not hold the folio lock
>   - does not have have page writeback bit set
>   - does not lock the buffer
>
> And so, it can race with folio_set_bh() on folio migration. The commit
> ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
> __find_get_block() and migration") added a spin lock to prevent races
> with page migration which ext4 users were reporting through the SUSE
> bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
> original filesystem corruption we can can reproduce fs corruption on
> ext4 by just removing the spinlock and stress testing the filesystem
> with generic/750, we eventually end up after 3 hours of testing with
> kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k.

Correct, jbd2 holds bh reference (its private jh structure attached to
bh->b_private holds it) and that is expected to protect jbd2 from anybody
else mucking with the bh.

> It turns out that the spin lock doesn't in the end protect against
> corruption, it *helps* reduce the possibility, but ext4 filesystem
> corruption can still happen even with the spin lock held. A test was
> done using vanilla Linux and adding a udelay(2000) right before we
> spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
> we can reproduce the same exact filesystem corruption issues as observed
> without the spinlock with generic/750 [1].

This is unexpected.

> ** Reproduced on vanilla Linux with udelay(2000) **
> 
> Call trace (ENOSPC journal failure):
>   do_writepages()
>     → ext4_do_writepages()
>       → ext4_map_blocks()
>         → ext4_ext_map_blocks()
>           → ext4_ext_insert_extent()
>             → __ext4_handle_dirty_metadata()
>               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

Curious. Did you try running e2fsck after the filesystem complained like
this? This complains about journal handle not having enough credits for
needed metadata update. Either we've lost some update to the journal_head
structure (b_modified got accidentally cleared) or some update to extent
tree.

> And so jbd2 still needs more work to avoid races with folio migration.
> So replace the current spin lock solution by just skipping jbd buffers
> on folio migration. We identify jbd buffers as its the only user of
> set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
> buffer_meta() and bailing on migration we fix the existing racy ext4
> corruption while also removing the spin lock to be held while sleeping
> complaints originally reported by 0-day [5], and paves the way for
> buffer-heads for more users of large folios other than the block
> device cache.

I think we need to understand why private_lock protection does not protect
bh users holding reference like jbd2 from folio migration before papering
over this problem with the hack. Because there are chances other simple
filesystems suffer from the same problem...

> diff --git a/mm/migrate.c b/mm/migrate.c
> index f3ee6d8d5e2e..32fa72ba10b4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> +	if (buffer_meta(head))
> +		return -EAGAIN;
> +
>  	if (!buffer_migrate_lock_buffers(head, mode))
>  		return -EAGAIN;
>  
> @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  			}
>  			bh = bh->b_this_page;
>  		} while (bh != head);
> +		spin_unlock(&mapping->i_private_lock);

No, you've just broken all simple filesystems (like ext2) with this patch.
You can reduce the spinlock critical section only after providing
alternative way to protect them from migration. So this should probably
happen at the end of the series.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR