lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H5WVUu3sHAJYhP+n2UKKKRNLbYHayBUvc8W9RC2_i5nuQ@mail.gmail.com>
Date: Tue, 13 May 2025 11:12:09 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Daniel Vacek <neelx@...e.com>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>, 
	linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: index buffer_tree using node size

On Mon, May 12, 2025 at 6:24 PM Daniel Vacek <neelx@...e.com> wrote:
>
> So far we are deriving the buffer tree index using the sector size. But each
> extent buffer covers multiple sectors. This makes the buffer tree rather sparse.
>
> For example the typical and quite common configuration uses sector size of 4KiB
> and node size of 16KiB. In this case it means the buffer tree is using up to
> the maximum of 25% of it's slots. Or in other words at least 75% of the tree
> slots are wasted as never used.
>
> We can score significant memory savings on the required tree nodes by indexing
> the tree using the node size instead. As a result far less slots are wasted
> and the tree can now use up to all 100% of it's slots this way.

Did you do any benchmarks? What results did you get?

If we could get in practice such a huge amount of space gains and
waste so much less memory, you should observe faster operations on the
buffer tree (xarray) as well, as we would get a smaller data
structure, with fewer nodes.

However, while this is a logically sound thing to do, in practice we
will always get many unused slots per xarray node, because:

1) We don't keep all extent buffers in memory all the time - some are
less frequently used and get evicted from memory after a while or
rarely get loaded in the first place, not all parts of a btree are
equally "hot";

2) It's a COW filesystem and metadata is always COWed, so you get
extent buffers allocated all over the place with big gaps between
them, in different block groups, etc.

And what about memory usage, did you see any significant reduction for
some workload? What was the reduction, what was the workload?

Xarray uses a kmem_cache to allocate nodes, so if we get such huge
gains as the change log claims, we should see a reduction by
monitoring files inside /sys/kernel/slab/radix_tree_node, like the
"objects" and "total_objects" files which tell the us the total amount
of allocated xarray nodes and how many are in use - this is for all
xarrays, so many will not belong to the buffer tree or even btrfs, but
it should still be very noticeable reduction on a workload that is
heavy on metadata, like fs_mark with empty files creating a large fs
tree (a few gigabytes at least).

Thanks.


>
> Signed-off-by: Daniel Vacek <neelx@...e.com>
> ---
>  fs/btrfs/disk-io.c   |  1 +
>  fs/btrfs/extent_io.c | 30 +++++++++++++++---------------
>  fs/btrfs/fs.h        |  3 ++-
>  3 files changed, 18 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5bcf11246ba66..dcea5b0a2db50 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3395,6 +3395,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>         fs_info->delalloc_batch = sectorsize * 512 * (1 + ilog2(nr_cpu_ids));
>
>         fs_info->nodesize = nodesize;
> +       fs_info->node_bits = ilog2(nodesize);
>         fs_info->sectorsize = sectorsize;
>         fs_info->sectorsize_bits = ilog2(sectorsize);
>         fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4d3584790cf7f..80a8563a25add 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1774,7 +1774,7 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>          */
>         spin_lock(&eb->refs_lock);
>         if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> -               XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->sectorsize_bits);
> +               XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->node_bits);
>                 unsigned long flags;
>
>                 set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> @@ -1874,7 +1874,7 @@ static void set_btree_ioerr(struct extent_buffer *eb)
>  static void buffer_tree_set_mark(const struct extent_buffer *eb, xa_mark_t mark)
>  {
>         struct btrfs_fs_info *fs_info = eb->fs_info;
> -       XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->sectorsize_bits);
> +       XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->node_bits);
>         unsigned long flags;
>
>         xas_lock_irqsave(&xas, flags);
> @@ -1886,7 +1886,7 @@ static void buffer_tree_set_mark(const struct extent_buffer *eb, xa_mark_t mark)
>  static void buffer_tree_clear_mark(const struct extent_buffer *eb, xa_mark_t mark)
>  {
>         struct btrfs_fs_info *fs_info = eb->fs_info;
> -       XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->sectorsize_bits);
> +       XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->node_bits);
>         unsigned long flags;
>
>         xas_lock_irqsave(&xas, flags);
> @@ -1986,7 +1986,7 @@ static unsigned int buffer_tree_get_ebs_tag(struct btrfs_fs_info *fs_info,
>         rcu_read_lock();
>         while ((eb = find_get_eb(&xas, end, tag)) != NULL) {
>                 if (!eb_batch_add(batch, eb)) {
> -                       *start = (eb->start + eb->len) >> fs_info->sectorsize_bits;
> +                       *start = (eb->start + eb->len) >> fs_info->node_bits;
>                         goto out;
>                 }
>         }
> @@ -2008,7 +2008,7 @@ static struct extent_buffer *find_extent_buffer_nolock(
>                 struct btrfs_fs_info *fs_info, u64 start)
>  {
>         struct extent_buffer *eb;
> -       unsigned long index = start >> fs_info->sectorsize_bits;
> +       unsigned long index = start >> fs_info->node_bits;
>
>         rcu_read_lock();
>         eb = xa_load(&fs_info->buffer_tree, index);
> @@ -2114,8 +2114,8 @@ void btrfs_btree_wait_writeback_range(struct btrfs_fs_info *fs_info, u64 start,
>                                       u64 end)
>  {
>         struct eb_batch batch;
> -       unsigned long start_index = start >> fs_info->sectorsize_bits;
> -       unsigned long end_index = end >> fs_info->sectorsize_bits;
> +       unsigned long start_index = start >> fs_info->node_bits;
> +       unsigned long end_index = end >> fs_info->node_bits;
>
>         eb_batch_init(&batch);
>         while (start_index <= end_index) {
> @@ -2151,7 +2151,7 @@ int btree_write_cache_pages(struct address_space *mapping,
>
>         eb_batch_init(&batch);
>         if (wbc->range_cyclic) {
> -               index = (mapping->writeback_index << PAGE_SHIFT) >> fs_info->sectorsize_bits;
> +               index = (mapping->writeback_index << PAGE_SHIFT) >> fs_info->node_bits;
>                 end = -1;
>
>                 /*
> @@ -2160,8 +2160,8 @@ int btree_write_cache_pages(struct address_space *mapping,
>                  */
>                 scanned = (index == 0);
>         } else {
> -               index = wbc->range_start >> fs_info->sectorsize_bits;
> -               end = wbc->range_end >> fs_info->sectorsize_bits;
> +               index = wbc->range_start >> fs_info->node_bits;
> +               end = wbc->range_end >> fs_info->node_bits;
>
>                 scanned = 1;
>         }
> @@ -3037,7 +3037,7 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
>         eb->fs_info = fs_info;
>  again:
>         xa_lock_irq(&fs_info->buffer_tree);
> -       exists = __xa_cmpxchg(&fs_info->buffer_tree, start >> fs_info->sectorsize_bits,
> +       exists = __xa_cmpxchg(&fs_info->buffer_tree, start >> fs_info->node_bits,
>                               NULL, eb, GFP_NOFS);
>         if (xa_is_err(exists)) {
>                 ret = xa_err(exists);
> @@ -3353,7 +3353,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  again:
>         xa_lock_irq(&fs_info->buffer_tree);
>         existing_eb = __xa_cmpxchg(&fs_info->buffer_tree,
> -                                  start >> fs_info->sectorsize_bits, NULL, eb,
> +                                  start >> fs_info->node_bits, NULL, eb,
>                                    GFP_NOFS);
>         if (xa_is_err(existing_eb)) {
>                 ret = xa_err(existing_eb);
> @@ -3456,7 +3456,7 @@ static int release_extent_buffer(struct extent_buffer *eb)
>                  * in this case.
>                  */
>                 xa_cmpxchg_irq(&fs_info->buffer_tree,
> -                              eb->start >> fs_info->sectorsize_bits, eb, NULL,
> +                              eb->start >> fs_info->node_bits, eb, NULL,
>                                GFP_ATOMIC);
>
>                 btrfs_leak_debug_del_eb(eb);
> @@ -4294,9 +4294,9 @@ static int try_release_subpage_extent_buffer(struct folio *folio)
>  {
>         struct btrfs_fs_info *fs_info = folio_to_fs_info(folio);
>         struct extent_buffer *eb;
> -       unsigned long start = folio_pos(folio) >> fs_info->sectorsize_bits;
> +       unsigned long start = folio_pos(folio) >> fs_info->node_bits;
>         unsigned long index = start;
> -       unsigned long end = index + (PAGE_SIZE >> fs_info->sectorsize_bits) - 1;
> +       unsigned long end = index + (PAGE_SIZE >> fs_info->node_bits) - 1;
>         int ret;
>
>         xa_lock_irq(&fs_info->buffer_tree);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index cf805b4032af3..8c9113304fabe 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -778,8 +778,9 @@ struct btrfs_fs_info {
>
>         struct btrfs_delayed_root *delayed_root;
>
> -       /* Entries are eb->start / sectorsize */
> +       /* Entries are eb->start >> node_bits */
>         struct xarray buffer_tree;
> +       int node_bits;
>
>         /* Next backup root to be overwritten */
>         int backup_root_index;
> --
> 2.47.2
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ