linux-kernel - Re: [PATCH] btrfs: index buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPjX3FfhfYGdMJLi4KLKnTd9BdhMLzScxtN0KKQxwYQ8JScxjg@mail.gmail.com>
Date: Wed, 14 May 2025 11:55:58 +0200
From: Daniel Vacek <neelx@...e.com>
To: dsterba@...e.cz
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>, 
	linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: index buffer_tree using node size

On Tue, 13 May 2025 at 19:10, David Sterba <dsterba@...e.cz> wrote:
>
> On Tue, May 13, 2025 at 09:56:19AM +0200, Daniel Vacek wrote:
> > On Mon, 12 May 2025 at 22:20, David Sterba <dsterba@...e.cz> wrote:
> > >
> > > On Mon, May 12, 2025 at 07:23:20PM +0200, Daniel Vacek wrote:
> > > > So far we are deriving the buffer tree index using the sector size. But each
> > > > extent buffer covers multiple sectors. This makes the buffer tree rather sparse.
> > > >
> > > > For example the typical and quite common configuration uses sector size of 4KiB
> > > > and node size of 16KiB. In this case it means the buffer tree is using up to
> > > > the maximum of 25% of it's slots. Or in other words at least 75% of the tree
> > > > slots are wasted as never used.
> > > >
> > > > We can score significant memory savings on the required tree nodes by indexing
> > > > the tree using the node size instead. As a result far less slots are wasted
> > > > and the tree can now use up to all 100% of it's slots this way.
> > >
> > > This looks interesting. Is there a way to get xarray stats? I don't see
> > > anything in the public API, e.g. depth, fanout, slack per level. For
> > > debugging purposes we can put it to sysfs or as syslog message,
> > > eventually as non-debugging output to commit_stats.
> >
> > I'm using a python script in crash (even live on my laptop). I believe
> > you could do the same in dragon. Though that's not the runtime stats
> > you described. And I don't really think it's worth it.
>
> How come you don't think it's worth it? You claim some numbers and we
> don't have a way to verify that or gather on various setups or
> workloads. I'd be interested in the numbers also to better understand
> how xarray performs with the extent buffers but I don't now how to write
> the analysis scripts in any of the tools, nor have time for that.

Well, xarray behaves as a generic radix tree. That's basic theory. It
is just a tool. A tool you know. You know how it works, what it is
good for and what to expect from it. That's why you decide to use it,
right?

I only claim, we're now limited to using 16 slots per node (mistakenly
skipping every 3 out of 4 slots as we're not using the last 2 bits
from the index) and this can be quite simply fixed by correct indexing
and we can use the xarray correctly. Basically we're not using the
tool correctly at the moment and we are using it very inefficiently. I
believe you can see that. So far so good, right?
No stats would explain it better than this. This is already the best
information you can get to describe, depict and understand the issue,
IMO.

In my experience, sometimes more stats can be misleading or
misinterpreted. Even by kernel engineers. Sometimes the stats can be
confusing or even not really relevant. So it's better to focus on the
right facts for the given issue.

And I also believe there is a reason xarray does not provide or even
keep any stats. What for? Would users need it? Would they benefit? Do
you care about any tree stats on your cell phone? I quite doubt so.
And kernel would be doing additional work just to keep the stats for
no use. That would also be inefficient and useless overhead.
Perhaps if it was an optional feature only for debug builds, maybe
useful during the development process??? But even then, you already
know you need to use this tool. And if you don't implement it from
scratch and you rather use the one kernel provides, you also know it
already works the best it can. In the end it's just a plain radix
tree.

Looking at it the other way around. The fact that xarray does not
provide any stats perhaps means no one really needed that so far? It
won't be that hard to implement if someone really needed it.
So why is it not there already? Maybe it would be just an additional
overhead which seems not justified so far, considering the feature is
still not implemented.
And IMO, that fact itself that it's still not implemented is also a
very good hint that it does not make much sense to really check.
Either it's the right tool for you no matter how the stats would end
up, or you need a different tool.

Now if you'd like some numbers to refresh how radix trees behave in
general, I can show the numbers I pulled yday for Filipe as he was
also curious. I'm going to be rather brief below, for further details
check my other email.

So for a radix tree, the height as well as the width of the tree are
fully determined by the maximal index. The one implemented in kernel
grows and shrinks accordingly. All the leaves are on the same/last
level. In a sense the leaves are kinda 'flat'.
Checking my /home fs on the lap I'm now writing from, the max
buffer_radix index at the moment was 4341c9c which means 5 levels.
Xarray uses radix 64 which means each 6 bits of index need a new tree
level. Now, since we're filling the last 2 bits of index with zeroes
(well, depending on the eb->start offset/alignment maybe some other
constant, but still a constant) we're effectively using only 4 bits
for the last (leaf) level. That's why we fill the leaves only up to
25% tops (provided the ebs densely follow each other). Of course we
don't always allocate all ebs, so the tree can be sparse by design.
But that's still OK, that's what we want. But...
We also make the tree needlessly wider than what would be optimal.
Correctly (what this patch does) we should get rid of the 2
constant/redundant index bits: 4341c9c >> 2 = 10d0727. See? Now the
width of the tree would be 10d0727 instead of 4341c9c. It's still the
same 5 levels of height, but it's now much more narrow/dense.
Depending on how sparse the tree is some nodes can be merged to one
this way. In theory up to 4, but again - that depends on how dense or
sparse the tree is to begin with.
The difference really is that the tree will not use more nodes than
needed which unfortunately can happen without the fix as the tree is
additionally filled with a thin air.

Out of curiosity I also checked how many ebs were actually present and
how many leaves were used in the tree. For this FS it was 13k of ebs
fitted in 4150 leaf nodes. So on average a bit over 3 ebs per leaf.
I'd estimate that with the fix we can get in a ballpark of 7-9 ebs per
leaf. Maybe something like 1600 leaves instead of 4150 for the same
13k ebs. IMO, that's a very nice win indeed.
The worst case would be no change. But the tree is not that sparse really.

Now imagine, if you wanted to use the rb tree here, you'd need 13k
tree nodes, but they'd be embedded in the ebs themselves. But the tree
height would be something like 14-28 depending on balancing.

In terms of used memory, for this example the rb tree would use about
300KB while the xarray could be estimated to about 900KB with this
patch (using 2.4MB now). That's if I estimate 1.5 megs saved by the
fix. That's the difference between using the tree correctly and
incorrectly.
And even if the estimation is not exact, we'll always get at least some savings.

Hopefully this gives you a better picture to satisfy your curiosity.