linux-kernel - Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250513003200.GZ9140@twin.jikos.cz>
Date: Tue, 13 May 2025 02:32:00 +0200
From: David Sterba <dsterba@...e.cz>
To: Daniel Vacek <neelx@...e.com>
Cc: dsterba@...e.cz, Chris Mason <clm@...com>,
	Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
	Boris Burkov <boris@....io>, linux-btrfs@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len`
 member field

On Mon, May 05, 2025 at 07:53:16PM +0200, Daniel Vacek wrote:
> On Mon, 5 May 2025 at 17:18, David Sterba <dsterba@...e.cz> wrote:
> >
> > On Mon, May 05, 2025 at 01:50:54PM +0200, Daniel Vacek wrote:
> > > Even super block nowadays uses nodesize for eb->len. This is since commits
> > >
> > > 551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
> > > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> > > ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
> > > a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")
> > >
> > > With these the eb->len is not really useful anymore. Let's use the nodesize
> > > directly where applicable.
> >
> > You haven't updated the changelog despite we had a discussion about the
> > potential drawbacks, so this should be here. But I'm not convinced this
> 
> Right. That's because I was not sure we came to any conclusion yet. I
> thought this discussion was still ongoing. I understand that so far
> this is still all just a theory and any premature conclusions may be
> misleading.
> 
> But yeah, I may write some kind of a warning or a disclaimer
> mentioning the suspicion. Just so that it gets documented and it is
> clear it was considered, even though maybe without a clear conclusion.
> 
> > is a good change. The eb size does not change so no better packing in
> > the slab and the caching of length in the same cacheline is lost.
> 
> So to be perfectly clear, what sharing do you mean? Is it the
> eb->start and eb->len you talking about? In other words, when getting
> `start` you also get `len` for free?

Yes, basically.

> Since the structure is 8 bytes aligned, they may eventually still end
> up in two cachelines. Luckily the size of the structure is 0 mod 16 so
> just out of the luck this never happens and they are always in the
> same cache line. But this may break with future changes, so it is not
> rock solid the way it is now.

Yes, with structures not perfectly aligned to a cacheline (64B) there
will be times when the access needs fetching two cachelines. With locks
it can result in various cacheline bouncing patterns when various
allocated structures are aligned to 8 bytes, giving 8 possible groups of
"misaligned" structure instances.

There's also a big assumption that the CPU cache prefetcher is clever
enough to average out many bad patterns. What I try to do on the source
code level is to be able to reason about the high level access patterns,
like "used at the same time -> close together in the structure".

> > In the assebly it's clear where the pointer dereference is added, using
> > an example from btrfs_get_token_8():
> >
> >   mov    0x8(%rbp),%r9d
> >
> > vs
> >
> >   mov    0x18(%rbp),%r10
> >   mov    0xd38(%r10),%r9d
> 
> I understand that. Again, this is what I originally considered. Not
> all the functions end up like this but there are definitely some. And
> by a rule of a thumb it's roughly half of them, give or take. That
> sounds like a good reason to be concerned.
> 
> My reasoning was that the fs_info->nodesize is accessed by many so the
> chances are it will be hot in cache. But you're right that this may
> not always be the case. It depends. The question remains if that makes
> a difference?
> 
> Another (IMO valid) point is that I believe the code will dereference
> many other pointers before getting here so this may seem like a drop
> in the sea. It's not like this was a tight loop scattering over
> hundreds random memory addresses.
> For example when this code is reached from a syscall, the syscall
> itself will have significantly higher overhead then one additional
> dereference. And I think the same applies when reached from an
> interrupt.
> Otherwise this would be visible on perf profile (which none of us
> performed yet).

Yeah and I don't intend to do so other than to verify your measurements
and calims that this patch does not make things worse at least. The
burden is on the patch submitter.

> Still, I'd say reading the assembly is a good indication of possible
> issues to be concerned with. But it's not always the best one. An
> observed performance regression would be.
> I'll be happy to conduct any suggested benchmarks. Though as you
> mentioned this may be also picky on the used HW.

I'd say any contemporary hardware is good, I don't think the internal
CPU algorithms of prefetching or branch predictions change that often. A
difference that I would consider significant is 5% in either direction.

> So even though we
> don't see any regressions with our testing, one day someone may find
> some if we merged this change. In which case we can always revert.
> 
> I have to say I'm pretty happy with the positive feedback from the
> other reviewers. So far you're the only one raising this concern.

I don't dispute the feedback from others, the patch is not wrong on
itself, it removes a redundant member. However I don't see it as a
simple change because I also spent some time looking into that
particular structure, shrinking size, member ordering and access
patterns that I am questioning the runtime performance implications.

My local eb->len removal patch is from 2020 and I decided not to send it
because I was not able to prove to myself it'd be for the better. This
is what I base my feedback on.

> So how do we conclude this?

I don't agree to add this patch due to the following main points:

- no struct size change, ie. same number of objects in the slab
- disputed cacheline behaviour, numbers showing improvement or not
  making it worse needed

> If you don't want this cleanup I'd opt at least for rename eb->len
> ==>> eb->nodesize.

This sounds like an unnecessary churn. 'length' related to a structure
is natural, nodesize is a property of the whole filesystem, we're used
to eb->len, I don't any benefit to increase the cognitive load.