[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPjX3FdNLjLGQ+0oyZgYOHAy0QxkEcycr86WQt9UyiLjXw5uJA@mail.gmail.com>
Date: Tue, 13 May 2025 12:43:03 +0200
From: Daniel Vacek <neelx@...e.com>
To: dsterba@...e.cz
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
Boris Burkov <boris@....io>, linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len`
member field
On Tue, 13 May 2025 at 02:32, David Sterba <dsterba@...e.cz> wrote:
>
> On Mon, May 05, 2025 at 07:53:16PM +0200, Daniel Vacek wrote:
> > On Mon, 5 May 2025 at 17:18, David Sterba <dsterba@...e.cz> wrote:
> > >
> > > On Mon, May 05, 2025 at 01:50:54PM +0200, Daniel Vacek wrote:
> > > > Even super block nowadays uses nodesize for eb->len. This is since commits
> > > >
> > > > 551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
> > > > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> > > > ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
> > > > a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")
> > > >
> > > > With these the eb->len is not really useful anymore. Let's use the nodesize
> > > > directly where applicable.
> > >
> > > You haven't updated the changelog despite we had a discussion about the
> > > potential drawbacks, so this should be here. But I'm not convinced this
> >
> > Right. That's because I was not sure we came to any conclusion yet. I
> > thought this discussion was still ongoing. I understand that so far
> > this is still all just a theory and any premature conclusions may be
> > misleading.
> >
> > But yeah, I may write some kind of a warning or a disclaimer
> > mentioning the suspicion. Just so that it gets documented and it is
> > clear it was considered, even though maybe without a clear conclusion.
> >
> > > is a good change. The eb size does not change so no better packing in
> > > the slab and the caching of length in the same cacheline is lost.
> >
> > So to be perfectly clear, what sharing do you mean? Is it the
> > eb->start and eb->len you talking about? In other words, when getting
> > `start` you also get `len` for free?
>
> Yes, basically.
>
> > Since the structure is 8 bytes aligned, they may eventually still end
> > up in two cachelines. Luckily the size of the structure is 0 mod 16 so
> > just out of the luck this never happens and they are always in the
> > same cache line. But this may break with future changes, so it is not
> > rock solid the way it is now.
>
> Yes, with structures not perfectly aligned to a cacheline (64B) there
> will be times when the access needs fetching two cachelines. With locks
> it can result in various cacheline bouncing patterns when various
> allocated structures are aligned to 8 bytes, giving 8 possible groups of
> "misaligned" structure instances.
>
> There's also a big assumption that the CPU cache prefetcher is clever
> enough to average out many bad patterns. What I try to do on the source
> code level is to be able to reason about the high level access patterns,
> like "used at the same time -> close together in the structure".
That makes sense. The closer they are the higher the chance they end
up in the same cacheline if the object itself is free of alignment.
> > > In the assebly it's clear where the pointer dereference is added, using
> > > an example from btrfs_get_token_8():
> > >
> > > mov 0x8(%rbp),%r9d
> > >
> > > vs
> > >
> > > mov 0x18(%rbp),%r10
> > > mov 0xd38(%r10),%r9d
> >
> > I understand that. Again, this is what I originally considered. Not
> > all the functions end up like this but there are definitely some. And
> > by a rule of a thumb it's roughly half of them, give or take. That
> > sounds like a good reason to be concerned.
> >
> > My reasoning was that the fs_info->nodesize is accessed by many so the
> > chances are it will be hot in cache. But you're right that this may
> > not always be the case. It depends. The question remains if that makes
> > a difference?
> >
> > Another (IMO valid) point is that I believe the code will dereference
> > many other pointers before getting here so this may seem like a drop
> > in the sea. It's not like this was a tight loop scattering over
> > hundreds random memory addresses.
> > For example when this code is reached from a syscall, the syscall
> > itself will have significantly higher overhead then one additional
> > dereference. And I think the same applies when reached from an
> > interrupt.
> > Otherwise this would be visible on perf profile (which none of us
> > performed yet).
>
> Yeah and I don't intend to do so other than to verify your measurements
> and calims that this patch does not make things worse at least. The
> burden is on the patch submitter.
Totally. I'll run some profiles to see if anything significant is
visible. Not sure what workload to use though? I guess fio would be a
good starting point?
Though we agreed that this also depends on the actual HW used.
Whatever numbers I present, if my CPU is fast enough and my storage is
slow enough, I'll hardly ever find any issue. I think this needs more
broader testing than just one's claims that this is OK for me
(provided my results would suggest so).
So how about pushing this to misc-next to get some broader testing and
see if there are any issues in the bigger scope?
> > Still, I'd say reading the assembly is a good indication of possible
> > issues to be concerned with. But it's not always the best one. An
> > observed performance regression would be.
> > I'll be happy to conduct any suggested benchmarks. Though as you
> > mentioned this may be also picky on the used HW.
>
> I'd say any contemporary hardware is good, I don't think the internal
> CPU algorithms of prefetching or branch predictions change that often. A
> difference that I would consider significant is 5% in either direction.
I meant the storage part rather than the CPU. Though the CPUs also
differ in cache sizes between consumer and server parts.
> > So even though we
> > don't see any regressions with our testing, one day someone may find
> > some if we merged this change. In which case we can always revert.
> >
> > I have to say I'm pretty happy with the positive feedback from the
> > other reviewers. So far you're the only one raising this concern.
>
> I don't dispute the feedback from others, the patch is not wrong on
> itself, it removes a redundant member. However I don't see it as a
> simple change because I also spent some time looking into that
> particular structure, shrinking size, member ordering and access
> patterns that I am questioning the runtime performance implications.
With that experience, would you suggest any other tests than fio where
a performance change could be visible? Something I can focus on?
> My local eb->len removal patch is from 2020 and I decided not to send it
> because I was not able to prove to myself it'd be for the better. This
> is what I base my feedback on.
I understand. Let's see what we can do about it.
> > So how do we conclude this?
>
> I don't agree to add this patch due to the following main points:
>
> - no struct size change, ie. same number of objects in the slab
Right. That makes sense as the original idea was about code cleanup
and not about reducing the struct size in terms of bytes.
I did not think a cleanup needs to be justified by reducing the struct
size. Even without that I believe it's still a good cleanup.
The struct is still reduced by one member (with a new hole now).
Moreover with the followup patch the size is actually reduced at least
for -rt configs by filling the new hole so that the later hole is not
created at all. This wins 8 bytes for -rt. Not much and maybe not
worth it. Or is it?
That said, as long as the removed member was not meant to cache the
fs-wide value for performance reasons. I did not see any documentation
or code comments suggesting that's the case (even you're not able to
tell for sure) and so I thought it may be fine to remove this eb->len
field.
Which brings us to the point below.
> - disputed cacheline behaviour, numbers showing improvement or not
> making it worse needed
I'll look into this. As I said, any hints are welcome otherwise I'll
start with fio and I'll compare the perf profiles of some functions
which access the data. Let me know if I can do anything better than
that, please.
> > If you don't want this cleanup I'd opt at least for rename eb->len
> > ==>> eb->nodesize.
>
> This sounds like an unnecessary churn. 'length' related to a structure
> is natural, nodesize is a property of the whole filesystem, we're used
> to eb->len, I don't any benefit to increase the cognitive load.
Now, our points of view differ here. For someone new to the codebase,
I need to carry on the additional information that eb->len actually
means the nodesize. In other words without digging this detail up the
code just hides that fact.
Since it's always set to nodesize, it does not really mean 'length' as
in any arbitrary length, like 12345 bytes.
And that's what I was trying to clean up. See? Does it make sense?
Also that's why as an alternative I suggested renaming the member if
removing it is not really favorable.
And I thought it can also have implications maybe at some places when
you realize that what's being used in the expression as eb->len really
means the fs-wide nodesize.
After all, with this insight I realized that the buffer_tree is
inefficient as we're wasting the slots using the tree really sparsely.
Since there are no shorter ebs, those empty slots can never be filled.
So using the right name actually makes sense, IMO. For code clarity.
But I can also understand your point and I can get used to it. It just
does not feel right to me, but it's not a big deal. I just mentioned
it to know others' POV and explain the intention one way or the other.
Powered by blists - more mailing lists