[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250401214951.kikcrmu5k3q6qmcr@offworld>
Date: Tue, 1 Apr 2025 14:49:51 -0700
From: Davidlohr Bueso <dave@...olabs.net>
To: Jan Kara <jack@...e.cz>
Cc: Luis Chamberlain <mcgrof@...nel.org>, brauner@...nel.org, tytso@....edu,
adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org,
riel@...riel.com, willy@...radead.org, hannes@...xchg.org,
oliver.sang@...el.com, david@...hat.com, axboe@...nel.dk,
hare@...e.de, david@...morbit.com, djwong@...nel.org,
ritesh.list@...il.com, linux-fsdevel@...r.kernel.org,
linux-block@...r.kernel.org, linux-mm@...ck.org,
gost.dev@...sung.com, p.raghav@...sung.com, da.gomez@...sung.com
Subject: Re: [PATCH 2/3] fs/buffer: avoid races with folio migrations on
__find_get_block_slow()
On Tue, 01 Apr 2025, Jan Kara wrote:
>I find this problematic. It fixes the race with migration, alright
>(although IMO we should have a comment very well explaining the interplay
>of folio lock and mapping->private_lock to make this work - probably in
>buffer_migrate_folio_norefs() - and reference it from here), but there are
>places which expect that if __find_get_block() doesn't return anything,
>this block is not cached in the buffer cache. And your change breaks this
>assumption. Look for example at write_boundary_block(), that will fail to
>write the block it should write if it races with someone locking the folio
>after your changes. Similarly the code tracking state of deleted metadata
>blocks in fs/jbd2/revoke.c will fail to properly update buffer's state if
>__find_get_block() suddently starts returning NULL although the buffer is
>present in cache.
Yeah - one thing I was thinking about, _iff_ failing lookups (__find_get_block()
returning nil) during migration is in fact permitted, was adding a BH_migrate
flag and serialize vs __buffer_migrate_folio() entirely. Semantically there
are no users, and none are added during this window, but as a consequence I
suppose one thread could see the page not cached, act upon that, then see it
cached once the migration is done and get confused(?). So I don't see a problem
here for write_boundary_block() specifically, but I'm probably overlooking others.
Now, if bailing on the lookup is not an option, meaning it must wait for the
migration to complete, I'm not sure large folios will ever be compatible with
the "Various filesystems appear to want __find_get_block to be non-blocking."
comment.
So the below could be tucked in for norefs only (because this is about the addr
space i_private_lock), but this also shortens the hold time; if that matters
at all, of course, vs changing the migration semantics.
Thanks,
Davidlohr
---8<----------------------------
diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..f585339ae2e4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -208,6 +208,14 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
head = folio_buffers(folio);
if (!head)
goto out_unlock;
+
+ bh = head;
+ do {
+ if (test_bit(BH_migrate, &bh->b_state))
+ goto out_unlock;
+ bh = bh->b_this_page;
+ } while (bh != head);
+
bh = head;
do {
if (!buffer_mapped(bh))
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 932139c5d46f..e956a1509a05 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -34,6 +34,7 @@ enum bh_state_bits {
BH_Meta, /* Buffer contains metadata */
BH_Prio, /* Buffer should be submitted with REQ_PRIO */
BH_Defer_Completion, /* Defer AIO completion to workqueue */
+ BH_migrate, /* Buffer is being migrated */
BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
diff --git a/mm/migrate.c b/mm/migrate.c
index a073eb6c5009..0ffa8b478fd3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -846,6 +846,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
if (!buffer_migrate_lock_buffers(head, mode))
return -EAGAIN;
+ bh = head;
+ do {
+ set_bit(BH_migrate, &bh->b_state);
+ bh = bh->b_this_page;
+ } while (bh != head);
+
if (check_refs) {
bool busy;
bool invalidated = false;
@@ -861,12 +867,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
}
bh = bh->b_this_page;
} while (bh != head);
+ spin_unlock(&mapping->i_private_lock);
if (busy) {
if (invalidated) {
rc = -EAGAIN;
goto unlock_buffers;
}
- spin_unlock(&mapping->i_private_lock);
invalidate_bh_lrus();
invalidated = true;
goto recheck_buffers;
@@ -884,10 +890,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
} while (bh != head);
unlock_buffers:
- if (check_refs)
- spin_unlock(&mapping->i_private_lock);
bh = head;
do {
+ clear_bit(BH_migrate, &bh->b_state)
unlock_buffer(bh);
bh = bh->b_this_page;
} while (bh != head);
Powered by blists - more mailing lists