[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H7eSw0gg=JQwiEe9_pSEqcxugpgOWJDdv45UHrZbsFhzw@mail.gmail.com>
Date: Tue, 18 Feb 2025 16:28:35 +0000
From: Filipe Manana <fdmanana@...nel.org>
To: Daniel Vacek <neelx@...e.com>
Cc: Hao-ran Zheng <zhenghaoran154@...il.com>, clm@...com, josef@...icpanda.com,
dsterba@...e.com, linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
baijiaju1990@...il.com, 21371365@...a.edu.cn
Subject: Re: [PATCH] btrfs: fix data race when accessing the block_group's
used field
On Tue, Feb 18, 2025 at 4:14 PM Daniel Vacek <neelx@...e.com> wrote:
>
> On Tue, 18 Feb 2025 at 11:19, Filipe Manana <fdmanana@...nel.org> wrote:
> >
> > On Tue, Feb 18, 2025 at 8:08 AM Daniel Vacek <neelx@...e.com> wrote:
> > >
> > > On Mon, 10 Feb 2025 at 12:11, Filipe Manana <fdmanana@...nel.org> wrote:
> > > >
> > > > On Sat, Feb 8, 2025 at 7:38 AM Hao-ran Zheng <zhenghaoran154@...il.com> wrote:
> > > > >
> > > > > A data race may occur when the function `btrfs_discard_queue_work()`
> > > > > and the function `btrfs_update_block_group()` is executed concurrently.
> > > > > Specifically, when the `btrfs_update_block_group()` function is executed
> > > > > to lines `cache->used = old_val;`, and `btrfs_discard_queue_work()`
> > > > > is accessing `if(block_group->used == 0)` will appear with data race,
> > > > > which may cause block_group to be placed unexpectedly in discard_list or
> > > > > discard_unused_list. The specific function call stack is as follows:
> > > > >
> > > > > ============DATA_RACE============
> > > > > btrfs_discard_queue_work+0x245/0x500 [btrfs]
> > > > > __btrfs_add_free_space+0x3066/0x32f0 [btrfs]
> > > > > btrfs_add_free_space+0x19a/0x200 [btrfs]
> > > > > unpin_extent_range+0x847/0x2120 [btrfs]
> > > > > btrfs_finish_extent_commit+0x9a3/0x1840 [btrfs]
> > > > > btrfs_commit_transaction+0x5f65/0xc0f0 [btrfs]
> > > > > transaction_kthread+0x764/0xc20 [btrfs]
> > > > > kthread+0x292/0x330
> > > > > ret_from_fork+0x4d/0x80
> > > > > ret_from_fork_asm+0x1a/0x30
> > > > > ============OTHER_INFO============
> > > > > btrfs_update_block_group+0xa9d/0x2430 [btrfs]
> > > > > __btrfs_free_extent+0x4f69/0x9920 [btrfs]
> > > > > __btrfs_run_delayed_refs+0x290e/0xd7d0 [btrfs]
> > > > > btrfs_run_delayed_refs+0x317/0x770 [btrfs]
> > > > > flush_space+0x388/0x1440 [btrfs]
> > > > > btrfs_preempt_reclaim_metadata_space+0xd65/0x14c0 [btrfs]
> > > > > process_scheduled_works+0x716/0xf10
> > > > > worker_thread+0xb6a/0x1190
> > > > > kthread+0x292/0x330
> > > > > ret_from_fork+0x4d/0x80
> > > > > ret_from_fork_asm+0x1a/0x30
> > > > > =================================
> > > > >
> > > > > Although the `block_group->used` was checked again in the use of the
> > > > > `peek_discard_list` function, considering that `block_group->used` is
> > > > > a 64-bit variable, we still think that the data race here is an
> > > > > unexpected behavior. It is recommended to add `READ_ONCE` and
> > > > > `WRITE_ONCE` to read and write.
> > > > >
> > > > > Signed-off-by: Hao-ran Zheng <zhenghaoran154@...il.com>
> > > > > ---
> > > > > fs/btrfs/block-group.c | 4 ++--
> > > > > fs/btrfs/discard.c | 2 +-
> > > > > 2 files changed, 3 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > > > index c0a8f7d92acc..c681b97f6835 100644
> > > > > --- a/fs/btrfs/block-group.c
> > > > > +++ b/fs/btrfs/block-group.c
> > > > > @@ -3678,7 +3678,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> > > > > old_val = cache->used;
> > > > > if (alloc) {
> > > > > old_val += num_bytes;
> > > > > - cache->used = old_val;
> > > > > + WRITE_ONCE(cache->used, old_val);
> > > > > cache->reserved -= num_bytes;
> > > > > cache->reclaim_mark = 0;
> > > > > space_info->bytes_reserved -= num_bytes;
> > > > > @@ -3690,7 +3690,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> > > > > spin_unlock(&space_info->lock);
> > > > > } else {
> > > > > old_val -= num_bytes;
> > > > > - cache->used = old_val;
> > > > > + WRITE_ONCE(cache->used, old_val);
> > > > > cache->pinned += num_bytes;
> > > > > btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
> > > > > space_info->bytes_used -= num_bytes;
> > > > > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > > > > index e815d165cccc..71c57b571d50 100644
> > > > > --- a/fs/btrfs/discard.c
> > > > > +++ b/fs/btrfs/discard.c
> > > > > @@ -363,7 +363,7 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> > > > > if (!block_group || !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
> > > > > return;
> > > > >
> > > > > - if (block_group->used == 0)
> > > > > + if (READ_ONCE(block_group->used) == 0)
> > > >
> > > > There are at least 3 more places in discard.c where we access ->used
> > > > without being under the protection of the block group's spinlock.
> > > > So let's fix this for all places and not just a single one...
> > > >
> > > > Also, this is quite ugly to spread READ_ONCE/WRITE_ONCE all over the place.
> > > > What we typically do in btrfs is to add helpers that hide them, see
> > > > block-rsv.h for example.
> > > >
> > > > Also, I don't think we need READ_ONCE/WRITE_ONCE.
> > > > We could use data_race(), though I think that could be subject to
> > > > load/store tearing, or just take the lock.
> > > > So adding a helper like this to block-group.h:
> > > >
> > > > static inline u64 btrfs_block_group_used(struct btrfs_block_group *bg)
> > > > {
> > > > u64 ret;
> > > >
> > > > spin_lock(&bg->lock);
> > > > ret = bg->used;
> > > > spin_unlock(&bg->lock);
> > > >
> > > > return ret;
> > > > }
> > >
> > > Would memory barriers be sufficient here? Taking a lock just for
> > > reading one member seems excessive...
> >
> > Do you think there's heavy contention on this lock?
>
> This is not about contention. Spin lock should just never be used this
> way. Or any lock actually. The critical section only contains a single
> fetch operation which does not justify using a lock.
> Hence the only guarantee such lock usage provides are the implicit
> memory barriers (from which maybe only one of them is really needed
> depending on the context where this helper is going to be used).
>
> Simply put, the lock is degraded into a memory barrier this way. So
> why not just use the barriers in the first place if only ordering
> guarantees are required? It only makes sense.
As said earlier, it's a lot easier to reason about lock() and unlock()
calls rather than spreading memory barriers in the write and read
sides.
Historically he had several mistakes with barriers, they're simply not
as straightforward to reason as locks.
Plus spreading the barriers in the read and write sides makes the code
not so easy to read, not to mention in case of any new sites updating
the member, we'll have to not forget adding a barrier.
It's just easier to reason and maintain.
>
> > Usually I prefer to go the simplest way, and using locks is a lot more
> > straightforward and easier to understand than memory barriers.
>
> How so? Locks provide the same memory barriers implicitly.
I'm not saying they don't.
I'm talking about easier code to read and maintain.
>
> > Unless it's clear there's an observable performance penalty, keeping
> > it simple is better.
>
> Exactly. Here is where our opinions differ. For me 'simple' means
> without useless locking.
> I mean especially with locks they should only be used when absolutely
> necessary. In a sense of 'use the right tool for the job'. There are
> more lightweight tools possible in this context. Locks provide
> additional guarantees which are not needed here.
>
> On the other hand I understand that using a lock is a 'better safe
> than sorry' approach which should also work. Just keep in mind that
> spinlock may sleep on RT.
It's not about a better safe than sorry, but easier to read, reason
and maintain.
>
> > data_race() may be ok here, at least for one of the unprotected
> > accesses it's fine, but would have to analyse the other 3 cases.
>
> That's exactly my thoughts. Maybe not even the barriers are needed
> after all. This needs to be checked on a per-case basis.
>
> > >
> > > > And then use btrfs_bock_group_used() everywhere in discard.c where we
> > > > aren't holding a block group's lock.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > > add_to_discard_unused_list(discard_ctl, block_group);
> > > > > else
> > > > > add_to_discard_list(discard_ctl, block_group);
> > > > > --
> > > > > 2.34.1
> > > > >
> > > > >
> > > >
Powered by blists - more mailing lists