[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPjX3Ffy2=CQP2mx9Wa3BBR54fEAcuo8ADqeTVdcAmCO7g+gmg@mail.gmail.com>
Date: Fri, 2 May 2025 14:03:55 +0200
From: Daniel Vacek <neelx@...e.com>
To: dsterba@...e.cz
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: remove extent buffer's redundant `len` member field
On Fri, 2 May 2025 at 12:56, David Sterba <dsterba@...e.cz> wrote:
>
> On Wed, Apr 30, 2025 at 04:13:20PM +0200, Daniel Vacek wrote:
> > On Wed, 30 Apr 2025 at 15:30, David Sterba <dsterba@...e.cz> wrote:
> > >
> > > On Wed, Apr 30, 2025 at 10:21:18AM +0200, Daniel Vacek wrote:
> > > > > The benefit of duplicating the length in each eb is that it's in the
> > > > > same cacheline as the other members that are used for offset
> > > > > calculations or bit manipulations.
> > > > >
> > > > > Going to the fs_info->nodesize may or may not hit a cache, also because
> > > > > it needs to do 2 pointer dereferences, so from that perspective I think
> > > > > it's making it worse.
> > > >
> > > > I was considering that. Since fs_info is shared for all ebs and other
> > > > stuff like transactions, etc. I think the cache is hot most of the
> > > > time and there will be hardly any performance difference observable.
> > > > Though without benchmarks this is just a speculation (on both sides).
> > >
> > > The comparison is between "always access 1 cacheline" and "hope that the
> > > other cacheline is hot", yeah we don't have benchmarks for that but the
> > > first access pattern is not conditional.
> >
> > That's quite right. Though in many places we already have fs_info
> > anyways so it's rather accessing a cacheline in eb vs. accessing a
> > cacheline in fs_info. In the former case it's likely a hot memory due
> > to accessing surrounding members anyways, while in the later case is
> > hopefully hot as it's a heavily shared resource accessed when
> > processing other ebs or transactions.
> > But yeah, in some places we don't have the fs_info pointer yet and two
> > accesses are still needed.
>
> The fs_info got added to eb because it used to be passed as parameter to
> many functions.
Makes sense.
> > In theory fs_info could be shuffled to move nodesize to the same
> > cacheline with buffer_tree. Would that feel better to you?
>
> We'd get conflicting requirements for ordering in fs_info. Right now
> the nodesize/sectorsize/... are in once cacheline in fs_info and they're
> often used together in many functions. Reordering it to fit eb usage
> pattern may work but I'm not convinced we need it.
I agree.
> > > > > I don't think we need to do the optimization right now, but maybe in the
> > > > > future if there's a need to add something to eb. Still we can use the
> > > > > remaining 16 bytes up to 256 without making things worse.
> > > >
> > > > This really depends on configuration. On my laptop (Debian -rt kernel)
> > > > the eb struct is actually 272 bytes as the rt_mutex is significantly
> > > > heavier than raw spin lock. And -rt is a first class citizen nowadays,
> > > > often used in Kubernetes deployments like 5G RAN telco, dpdk and such.
> > > > I think it would be nice to slim the struct below 256 bytes even there
> > > > if that's your aim.
> > >
> > > I configured and built RT kernel to see if it's possible to go to 256
> > > bytes on RT and it seems yes with a big sacrifice of removing several
> > > struct members that cache values like folio_size or folio_shift and
> > > generating worse code.
> > >
> > > As 272 is a multiple of 16 it's a reasonable size and we don't need to
> > > optimize further. The number of ebs in one slab is 30, with the non-rt
> > > build it's 34, which sounds OK.
> >
> > That sounds fair. Well the 256 bytes were your argument in the first place.
>
> Yeah, 256 is a nice number because it aligns with cachelines on multiple
> architectures, this is useful for splitting the structure to the "data
> accessed together" and locking/refcounting. It's a tentative goal, we
> used to have larger eb size due to own locking implementation but with
> rwsems it got close/under 256.
>
> The current size 240 is 1/4 of cacheline shifted so it's not all clean
> but whe have some wiggle room for adding new members or cached values,
> like folio_size/folio_shift/addr.
Sounds like we could force align to cacheline or explicitly pad to
256B? The later could be a bit tricky though.
> >
> > Still, with this:
> >
> > --- a/fs/btrfs/extent_io.h
> > +++ b/fs/btrfs/extent_io.h
> > @@ -82,7 +82,10 @@ void __cold extent_buffer_free_cachep(void);
> > struct extent_buffer {
> > u64 start;
> > u32 folio_size;
> > - unsigned long bflags;
> > + u8 folio_shift;
> > + /* >= 0 if eb belongs to a log tree, -1 otherwise */
> > + s8 log_index;
> > + unsigned short bflags;
>
> This does not compile because of set_bit/clear_bit/wait_on_bit API
> requirements.
Yeah, I realized when I tried to implement it. It was just an email
idea when sent.
> > struct btrfs_fs_info *fs_info;
> >
> > /*
> > @@ -94,9 +97,6 @@ struct extent_buffer {
> > spinlock_t refs_lock;
> > atomic_t refs;
> > int read_mirror;
> > - /* >= 0 if eb belongs to a log tree, -1 otherwise */
> > - s8 log_index;
> > - u8 folio_shift;
> > struct rcu_head rcu_head;
> >
> > struct rw_semaphore lock;
> >
> > you're down to 256 even on -rt. And the great part is I don't see any
> > sacrifices (other than accessing a cacheline in fs_info). We're only
> > using 8 flags now, so there is still some room left for another 8 if
> > needed in the future.
>
> Which means that the size on non-rt would be something like 228, roughly
> calculating the savings and the increase due to spinloct_t going from
> 4 -> 32 bytes. Also I'd like to see the generated assembly after the
> suggested reordering.
If I see correctly the non-rt will not change when I keep ulong
bflags. The -rt build goes down to 264 bytes. That's a bit better for
free but still not ideal from alignment POV.
> The eb may not be perfect, I think there could be false sharing of
> refs_lock and refs but this is a wild guess and based only on code
refs_lock and refs look like they share the same cacheline in every
case. At least on x86.
But still, the slab object is not aligned in the first place. Luckily
the two fields roam together.
Out of curiosity, is there any past experience where this kind of
optimizations make a difference within a filesystem code?
I can imagine perhaps for fast devices like NVDIMM or DAX the CPU may
become the bottleneck? Or are nowadays NVMe devices already fast
enough to saturate the CPU?
I faintly recall one issue where I debugged a CPU which could not keep
up with handling the interrupts of finished IO on NVMe submitted by
other CPUs. Though that was on xfs (or maybe ext4) not on btrfs. But
that was a power-play of one against the rest as the interrupt was not
balanced or spread to more CPUs.
> observation. You may have more luck with other data structures with
> unnecessary holes but please optimize for non-RT first.
Clearly non-rt is the default and most important. No questions here.
Powered by blists - more mailing lists