[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e2dgjtqvqjapir5xizb5ixkilhzr7fm7m7ymxzk6ixzdbwxjjs@24n4nzolye77>
Date: Mon, 30 Jun 2025 18:32:54 +0200
From: Jan Kara <jack@...e.cz>
To: Baokun Li <libaokun1@...wei.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, tytso@....edu,
adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update
s_mb_last_group
On Mon 30-06-25 17:21:48, Baokun Li wrote:
> On 2025/6/30 15:47, Jan Kara wrote:
> > On Mon 30-06-25 11:48:20, Baokun Li wrote:
> > > On 2025/6/28 2:19, Jan Kara wrote:
> > > > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > > > After we optimized the block group lock, we found another lock
> > > > > contention issue when running will-it-scale/fallocate2 with multiple
> > > > > processes. The fallocate's block allocation and the truncate's block
> > > > > release were fighting over the s_md_lock. The problem is, this lock
> > > > > protects totally different things in those two processes: the list of
> > > > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > > > looking for new blocks (mb_last_group) when allocating.
> > > > >
> > > > > Now we only need to track s_mb_last_group and no longer need to track
> > > > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > > > to date by using smp_store_release/smp_load_acquire.
> > > > >
> > > > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > > >
> > > > > Performance test data follows:
> > > > >
> > > > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > > > Observation: Average fallocate operations per container per second.
> > > > >
> > > > > | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 |
> > > > > Disk: 960GB SSD |-------------------------|-------------------------|
> > > > > | base | patched | base | patched |
> > > > > -------------------|-------|-----------------|-------|-----------------|
> > > > > mb_optimize_scan=0 | 4821 | 7612 (+57.8%) | 15371 | 21647 (+40.8%) |
> > > > > mb_optimize_scan=1 | 4784 | 7568 (+58.1%) | 6101 | 9117 (+49.4%) |
> > > > >
> > > > > Signed-off-by: Baokun Li <libaokun1@...wei.com>
> > > > ...
> > > >
> > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > > > index 5cdae3bda072..3f103919868b 100644
> > > > > --- a/fs/ext4/mballoc.c
> > > > > +++ b/fs/ext4/mballoc.c
> > > > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > > > > ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > > > > folio_get(ac->ac_buddy_folio);
> > > > > /* store last allocated for subsequent stream allocation */
> > > > > - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > > > - spin_lock(&sbi->s_md_lock);
> > > > > - sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > > > - spin_unlock(&sbi->s_md_lock);
> > > > > - }
> > > > > + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > > > + /* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > > > + smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > > > Do you really need any kind of barrier (implied by smp_store_release())
> > > > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > > > with other accesses from the thread, isn't it? As such it should be enough
> > > > to have WRITE_ONCE() here...
> > > WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> > > that variable reads/writes access values directly from L1/L2 cache rather
> > > than registers.
> > I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> > particular they force the compiler to read / write the memory location
> > exactly once instead of reading it potentially multiple times in different
> > parts of expression and getting inconsistent values, or possibly writing
> > the value say byte by byte (yes, that would be insane but not contrary to
> > the C standard).
> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
> two main purposes:
>
> 1. It tells the compiler that the variable's value can change unexpectedly,
> preventing the compiler from making incorrect optimizations based on
> assumptions about its stability.
>
> 2. It ensures the CPU directly reads from or writes to the variable's
> memory address. This means the value will be fetched from cache (L1/L2)
> if available, or from main memory otherwise, rather than using a stale
> value from a CPU register.
Yes, we agree on this.
> > > They do not guarantee that other CPUs see the latest values. Reading stale
> > > values could lead to more useless traversals, which might incur higher
> > > overhead than memory barriers. This is why we use memory barriers to ensure
> > > the latest values are read.
> > But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> > seeing latest values either. They are just speculation barriers meaning
> > they prevent the CPU from reordering accesses in the code after
> > smp_load_acquire() to be performed before the smp_load_acquire() is
> > executed and similarly with smp_store_release(). So I dare to say that
> > these barries have no (positive) impact on the allocation performance and
> > just complicate the code - but if you have some data that show otherwise,
> > I'd be happy to be proven wrong.
> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
> latest data.
>
> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
> a=0 in their caches.
>
> Without a memory barrier:
> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
> a=1 is written to main memory and becomes visible to other CPUs.
> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
> its invalidation queue. However, it might not process it immediately;
> instead, it could perform the read first, potentially still reading a=0
> from its cache.
>
> With a memory barrier:
> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
> the store buffer, but data in the store buffer is also written to main
> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>
> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
> to its invalidation queue. Here, the invalidation queue is flushed, which
> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
> performs the read, its cache is invalid, so it reads the latest a=1 from
> main memory.
Well, here I think you assume way more about the CPU architecture than is
generally true (and I didn't find what you write above guaranteed neither
by x86 nor by arm64 CPU documentation). Generally I'm following the
guarantees as defined by Documentation/memory-barriers.txt and there you
can argue only about order of effects as observed by different CPUs but not
really about when content is fetched to / from CPU caches.
BTW on x86 in particular smp_load_acquire() and smp_store_release() aren't
very different from pure READ_ONCE() / WRITE_ONCE:
arch/x86/include/asm/barrier.h:
#define __smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
barrier(); \
WRITE_ONCE(*p, v); \
} while (0)
#define __smp_load_acquire(p) \
({ \
typeof(*p) ___p1 = READ_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
barrier(); \
___p1; \
})
where barrier() is just a compiler barrier - i.e., preventing the compiler
from reordering accesses around this point. This is because x86 is strongly
ordered and the CPU can only reorder loads earlier than previous stores.
TL;DR; on x86 there's no practical difference between using READ_ONCE() /
WRITE_ONCE() and smp_load_acquire() and smp_store_release() in your code.
So I still think using those will be clearer and I'd be curious if you can
see any performance impacts from using READ_ONCE / WRITE_ONCE instead of
smp_load_acquire() / smp_store_release().
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists