[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250410014945.2140781-2-mcgrof@kernel.org>
Date: Wed, 9 Apr 2025 18:49:38 -0700
From: Luis Chamberlain <mcgrof@...nel.org>
To: brauner@...nel.org,
jack@...e.cz,
tytso@....edu,
adilger.kernel@...ger.ca,
linux-ext4@...r.kernel.org,
riel@...riel.com
Cc: dave@...olabs.net,
willy@...radead.org,
hannes@...xchg.org,
oliver.sang@...el.com,
david@...hat.com,
axboe@...nel.dk,
hare@...e.de,
david@...morbit.com,
djwong@...nel.org,
ritesh.list@...il.com,
linux-fsdevel@...r.kernel.org,
linux-block@...r.kernel.org,
linux-mm@...ck.org,
gost.dev@...sung.com,
p.raghav@...sung.com,
da.gomez@...sung.com,
mcgrof@...nel.org,
syzbot+f3c6fda1297c748a7076@...kaller.appspotmail.com
Subject: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
Filesystems which use buffer-heads where it cannot guarantees that there
are no other references to the folio, for example with a folio
lock, must use buffer_migrate_folio_norefs() for the address space
mapping migrate_folio() callback. There are only 3 filesystems which use
this callback:
1) the block device cache
2) ext4 for its ext4_journalled_aops, ie, jbd2
3) nilfs2
jbd2's use of this however callback however is very race prone, consider
folio migration while reviewing jbd2_journal_write_metadata_buffer()
and the fact that jbd2:
- does not hold the folio lock
- does not have have page writeback bit set
- does not lock the buffer
And so, it can race with folio_set_bh() on folio migration. The commit
ebdf4de5642fb6 ("mm: migrate: fix reference check race between
__find_get_block() and migration") added a spin lock to prevent races
with page migration which ext4 users were reporting through the SUSE
bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
original filesystem corruption we can can reproduce fs corruption on
ext4 by just removing the spinlock and stress testing the filesystem
with generic/750, we eventually end up after 3 hours of testing with
kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k. It
turns out that the spin lock doesn't in the end protect against
corruption, it *helps* reduce the possibility, but ext4 filesystem
corruption can still happen even with the spin lock held. A test was
done using vanilla Linux and adding a udelay(2000) right before we
spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
we can reproduce the same exact filesystem corruption issues as observed
without the spinlock with generic/750 [1].
We now have a slew of traces collected for the ext4 corruptions possible
without the current spin lock held [2] [3] [4] but the general pattern
is as follows, as summarized by ChatGPT from all traces:
do_writepages() # write back -->
ext4_map_block() # performs logical to physical block mapping -->
ext4_ext_insert_extent() # updates extent tree -->
jbd2_journal_dirty_metadata() # marks metadata as dirty for
# journaling. This can lead
# to any of the following hints
# as to what happened from
# ext4 / jbd2
- Directory and extent metadata corruption splats or
- Filure to handle out-of-space conditions gracefully, with
cascading metadata errors and eventual filesystem shutdown
to prevent further damage.
- Failure to journal new extent metadata during extent tree
growth, triggered under memory pressure or heavy writeback.
Commonly results in ENOSPC, journal abort, and read-only
fallback. **
- Journal metadata failure during extent tree growth causes
read-only fallback. Seen repeatedly on small-block (2k)
filesystems under stress (e.g. fsstress). Triggers errors in
bitmap and inode updates, and persists in journal replay logs.
"Error count since last fsck" shows long-term corruption
footprint.
** Reproduced on vanilla Linux with udelay(2000) **
Call trace (ENOSPC journal failure):
do_writepages()
→ ext4_do_writepages()
→ ext4_map_blocks()
→ ext4_ext_map_blocks()
→ ext4_ext_insert_extent()
→ __ext4_handle_dirty_metadata()
→ jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)
And so jbd2 still needs more work to avoid races with folio migration.
So replace the current spin lock solution by just skipping jbd buffers
on folio migration. We identify jbd buffers as its the only user of
set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
buffer_meta() and bailing on migration we fix the existing racy ext4
corruption while also removing the spin lock to be held while sleeping
complaints originally reported by 0-day [5], and paves the way for
buffer-heads for more users of large folios other than the block
device cache.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1137609 # [0]
Link: https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250408-ext4-jbd-force-corruption # [1]
Link: https://lkml.kernel.org/r/Z-ZwToVfJbdTVRtG@bombadil.infradead.org # [2]
Link: https://lore.kernel.org/all/Z-rzyrS0Jr7t984Y@bombadil.infradead.org/ # [3]
Link: https://lore.kernel.org/all/Z_AA9SHZdRGq6tb4@bombadil.infradead.org/ # [4]
Reported-by: kernel test robot <oliver.sang@...el.com>
Reported-by: syzbot+f3c6fda1297c748a7076@...kaller.appspotmail.com
Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com # [5]
Fixes: ebdf4de5642fb6 ("mm: migrate: fix reference check race between __find_get_block() and migration")
Signed-off-by: Luis Chamberlain <mcgrof@...nel.org>
---
mm/migrate.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..32fa72ba10b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
if (folio_ref_count(src) != expected_count)
return -EAGAIN;
+ if (buffer_meta(head))
+ return -EAGAIN;
+
if (!buffer_migrate_lock_buffers(head, mode))
return -EAGAIN;
@@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
}
bh = bh->b_this_page;
} while (bh != head);
+ spin_unlock(&mapping->i_private_lock);
if (busy) {
if (invalidated) {
rc = -EAGAIN;
goto unlock_buffers;
}
- spin_unlock(&mapping->i_private_lock);
invalidate_bh_lrus();
invalidated = true;
goto recheck_buffers;
@@ -880,10 +883,7 @@ static int __buffer_migrate_folio(struct address_space *mapping,
folio_set_bh(bh, dst, bh_offset(bh));
bh = bh->b_this_page;
} while (bh != head);
-
unlock_buffers:
- if (check_refs)
- spin_unlock(&mapping->i_private_lock);
bh = head;
do {
unlock_buffer(bh);
--
2.47.2
Powered by blists - more mailing lists