linux-kernel - Re: [PATCH] btrfs: remove extent buffer's redundant `len` member field

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250502105630.GO9140@suse.cz>
Date: Fri, 2 May 2025 12:56:30 +0200
From: David Sterba <dsterba@...e.cz>
To: Daniel Vacek <neelx@...e.com>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
	David Sterba <dsterba@...e.com>, linux-btrfs@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: remove extent buffer's redundant `len` member
 field

On Wed, Apr 30, 2025 at 04:13:20PM +0200, Daniel Vacek wrote:
> On Wed, 30 Apr 2025 at 15:30, David Sterba <dsterba@...e.cz> wrote:
> >
> > On Wed, Apr 30, 2025 at 10:21:18AM +0200, Daniel Vacek wrote:
> > > > The benefit of duplicating the length in each eb is that it's in the
> > > > same cacheline as the other members that are used for offset
> > > > calculations or bit manipulations.
> > > >
> > > > Going to the fs_info->nodesize may or may not hit a cache, also because
> > > > it needs to do 2 pointer dereferences, so from that perspective I think
> > > > it's making it worse.
> > >
> > > I was considering that. Since fs_info is shared for all ebs and other
> > > stuff like transactions, etc. I think the cache is hot most of the
> > > time and there will be hardly any performance difference observable.
> > > Though without benchmarks this is just a speculation (on both sides).
> >
> > The comparison is between "always access 1 cacheline" and "hope that the
> > other cacheline is hot", yeah we don't have benchmarks for that but the
> > first access pattern is not conditional.
> 
> That's quite right. Though in many places we already have fs_info
> anyways so it's rather accessing a cacheline in eb vs. accessing a
> cacheline in fs_info. In the former case it's likely a hot memory due
> to accessing surrounding members anyways, while in the later case is
> hopefully hot as it's a heavily shared resource accessed when
> processing other ebs or transactions.
> But yeah, in some places we don't have the fs_info pointer yet and two
> accesses are still needed.

The fs_info got added to eb because it used to be passed as parameter to
many functions.

> In theory fs_info could be shuffled to move nodesize to the same
> cacheline with buffer_tree. Would that feel better to you?

We'd get conflicting requirements for ordering in fs_info. Right now
the nodesize/sectorsize/... are in once cacheline in fs_info and they're
often used together in many functions. Reordering it to fit eb usage
pattern may work but I'm not convinced we need it.

> > > > I don't think we need to do the optimization right now, but maybe in the
> > > > future if there's a need to add something to eb. Still we can use the
> > > > remaining 16 bytes up to 256 without making things worse.
> > >
> > > This really depends on configuration. On my laptop (Debian -rt kernel)
> > > the eb struct is actually 272 bytes as the rt_mutex is significantly
> > > heavier than raw spin lock. And -rt is a first class citizen nowadays,
> > > often used in Kubernetes deployments like 5G RAN telco, dpdk and such.
> > > I think it would be nice to slim the struct below 256 bytes even there
> > > if that's your aim.
> >
> > I configured and built RT kernel to see if it's possible to go to 256
> > bytes on RT and it seems yes with a big sacrifice of removing several
> > struct members that cache values like folio_size or folio_shift and
> > generating worse code.
> >
> > As 272 is a multiple of 16 it's a reasonable size and we don't need to
> > optimize further. The number of ebs in one slab is 30, with the non-rt
> > build it's 34, which sounds OK.
> 
> That sounds fair. Well the 256 bytes were your argument in the first place.

Yeah, 256 is a nice number because it aligns with cachelines on multiple
architectures, this is useful for splitting the structure to the "data
accessed together" and locking/refcounting. It's a tentative goal, we
used to have larger eb size due to own locking implementation but with
rwsems it got close/under 256.

The current size 240 is 1/4 of cacheline shifted so it's not all clean
but whe have some wiggle room for adding new members or cached values,
like folio_size/folio_shift/addr.


> 
> Still, with this:
> 
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -82,7 +82,10 @@ void __cold extent_buffer_free_cachep(void);
>  struct extent_buffer {
>         u64 start;
>         u32 folio_size;
> -       unsigned long bflags;
> +       u8 folio_shift;
> +       /* >= 0 if eb belongs to a log tree, -1 otherwise */
> +       s8 log_index;
> +       unsigned short bflags;

This does not compile because of set_bit/clear_bit/wait_on_bit API
requirements.

>         struct btrfs_fs_info *fs_info;
> 
>         /*
> @@ -94,9 +97,6 @@ struct extent_buffer {
>         spinlock_t refs_lock;
>         atomic_t refs;
>         int read_mirror;
> -       /* >= 0 if eb belongs to a log tree, -1 otherwise */
> -       s8 log_index;
> -       u8 folio_shift;
>         struct rcu_head rcu_head;
> 
>         struct rw_semaphore lock;
> 
> you're down to 256 even on -rt. And the great part is I don't see any
> sacrifices (other than accessing a cacheline in fs_info). We're only
> using 8 flags now, so there is still some room left for another 8 if
> needed in the future.

Which means that the size on non-rt would be something like 228, roughly
calculating the savings and the increase due to spinloct_t going from
4 -> 32 bytes. Also I'd like to see the generated assembly after the
suggested reordering.

The eb may not be perfect, I think there could be false sharing of
refs_lock and refs but this is a wild guess and based only on code
observation. You may have more luck with other data structures with
unnecessary holes but please optimize for non-RT first.