lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPjX3FdexSywSbJQfrj5pazrBRyVns3SdRCsw1VmvhrJv20bvw@mail.gmail.com>
Date: Wed, 30 Apr 2025 16:13:20 +0200
From: Daniel Vacek <neelx@...e.com>
To: dsterba@...e.cz
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>, 
	linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: remove extent buffer's redundant `len` member field

On Wed, 30 Apr 2025 at 15:30, David Sterba <dsterba@...e.cz> wrote:
>
> On Wed, Apr 30, 2025 at 10:21:18AM +0200, Daniel Vacek wrote:
> > > The benefit of duplicating the length in each eb is that it's in the
> > > same cacheline as the other members that are used for offset
> > > calculations or bit manipulations.
> > >
> > > Going to the fs_info->nodesize may or may not hit a cache, also because
> > > it needs to do 2 pointer dereferences, so from that perspective I think
> > > it's making it worse.
> >
> > I was considering that. Since fs_info is shared for all ebs and other
> > stuff like transactions, etc. I think the cache is hot most of the
> > time and there will be hardly any performance difference observable.
> > Though without benchmarks this is just a speculation (on both sides).
>
> The comparison is between "always access 1 cacheline" and "hope that the
> other cacheline is hot", yeah we don't have benchmarks for that but the
> first access pattern is not conditional.

That's quite right. Though in many places we already have fs_info
anyways so it's rather accessing a cacheline in eb vs. accessing a
cacheline in fs_info. In the former case it's likely a hot memory due
to accessing surrounding members anyways, while in the later case is
hopefully hot as it's a heavily shared resource accessed when
processing other ebs or transactions.
But yeah, in some places we don't have the fs_info pointer yet and two
accesses are still needed.

In theory fs_info could be shuffled to move nodesize to the same
cacheline with buffer_tree. Would that feel better to you?

> > > I don't think we need to do the optimization right now, but maybe in the
> > > future if there's a need to add something to eb. Still we can use the
> > > remaining 16 bytes up to 256 without making things worse.
> >
> > This really depends on configuration. On my laptop (Debian -rt kernel)
> > the eb struct is actually 272 bytes as the rt_mutex is significantly
> > heavier than raw spin lock. And -rt is a first class citizen nowadays,
> > often used in Kubernetes deployments like 5G RAN telco, dpdk and such.
> > I think it would be nice to slim the struct below 256 bytes even there
> > if that's your aim.
>
> I configured and built RT kernel to see if it's possible to go to 256
> bytes on RT and it seems yes with a big sacrifice of removing several
> struct members that cache values like folio_size or folio_shift and
> generating worse code.
>
> As 272 is a multiple of 16 it's a reasonable size and we don't need to
> optimize further. The number of ebs in one slab is 30, with the non-rt
> build it's 34, which sounds OK.

That sounds fair. Well the 256 bytes were your argument in the first place.

Still, with this:

--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -82,7 +82,10 @@ void __cold extent_buffer_free_cachep(void);
 struct extent_buffer {
        u64 start;
        u32 folio_size;
-       unsigned long bflags;
+       u8 folio_shift;
+       /* >= 0 if eb belongs to a log tree, -1 otherwise */
+       s8 log_index;
+       unsigned short bflags;
        struct btrfs_fs_info *fs_info;

        /*
@@ -94,9 +97,6 @@ struct extent_buffer {
        spinlock_t refs_lock;
        atomic_t refs;
        int read_mirror;
-       /* >= 0 if eb belongs to a log tree, -1 otherwise */
-       s8 log_index;
-       u8 folio_shift;
        struct rcu_head rcu_head;

        struct rw_semaphore lock;

you're down to 256 even on -rt. And the great part is I don't see any
sacrifices (other than accessing a cacheline in fs_info). We're only
using 8 flags now, so there is still some room left for another 8 if
needed in the future.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ