[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1ZBDLim8ZK-Fc9gXUVht9rJOdSTKO+fb+kxoGpWuwTu9w@mail.gmail.com>
Date: Mon, 31 Mar 2025 14:48:01 -0700
From: Joanne Koong <joannelkoong@...il.com>
To: Jaco Kroon <jaco@....co.za>
Cc: bernd.schubert@...tmail.fm, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, miklos@...redi.hu, rdunlap@...radead.org,
trapexit@...wn.link
Subject: Re: [PATCH 2/2] fuse: Adjust readdir() buffer to requesting buffer size.
On Mon, Mar 31, 2025 at 1:43 PM Jaco Kroon <jaco@....co.za> wrote:
>
> Hi,
>
> On 2025/03/31 18:41, Joanne Koong wrote:
> > On Fri, Mar 14, 2025 at 3:39 PM Jaco Kroon<jaco@....co.za> wrote:
> >> Clamp to min 1 page (4KB) and max 128 pages (512KB).
> >>
> >> Glusterfs trial using strace ls -l.
> >>
> >> Before:
> >>
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 616
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 624
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 624
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
> >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 608
> >> getdents64(3, 0x7f2d7d7a7040 /* 1 entries */, 131072) = 24
> >> getdents64(3, 0x7f2d7d7a7040 /* 0 entries */, 131072) = 0
> >>
> >> After:
> >>
> >> getdents64(3, 0x7ffae8eed040 /* 276 entries */, 131072) = 6696
> >> getdents64(3, 0x7ffae8eed040 /* 0 entries */, 131072) = 0
> >>
> >> Signed-off-by: Jaco Kroon<jaco@....co.za>
> >> ---
> >> fs/fuse/readdir.c | 22 ++++++++++++++++++----
> >> 1 file changed, 18 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> >> index 17ce9636a2b1..a0ccbc84b000 100644
> >> --- a/fs/fuse/readdir.c
> >> +++ b/fs/fuse/readdir.c
> >> @@ -337,11 +337,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> >> struct fuse_mount *fm = get_fuse_mount(inode);
> >> struct fuse_io_args ia = {};
> >> struct fuse_args_pages *ap = &ia.ap;
> >> - struct fuse_folio_desc desc = { .length = PAGE_SIZE };
> >> + struct fuse_folio_desc desc = { .length = ctx->count };
> >> u64 attr_version = 0, evict_ctr = 0;
> >> bool locked;
> >> + int order;
> >>
> >> - folio = folio_alloc(GFP_KERNEL, 0);
> >> + if (desc.length < PAGE_SIZE)
> >> + desc.length = PAGE_SIZE;
> >> + else if (desc.length > (PAGE_SIZE << 7)) /* 128 pages, typically 512KB */
> >> + desc.length = PAGE_SIZE << 7;
> >> +
> > Just wondering, how did 128 pages get decided as the upper bound? It
> > seems to me to make more sense if the upper bound is fc->max_pages.
>
> Best answer ... random/guess at something which may be sensible.
>
> > Also btw, I think you can just use the clamp() helper from
> > <linux/minmax.h> to do the clamping
>
> Thanks. Not a regular contributor to the kernel, not often that I've
> got an itch that needs scratching here :).
>
> So something like this then:
>
> 345
> 346 desc.length = clamp(desc.length, PAGE_SIZE, fm->fc->max_pages <<
> CONFIG_PAGE_SHIFT);
> 347 order = get_count_order(desc.length >> CONFIG_PAGE_SHIFT);
> 348
>
You can just use PAGE_SHIFT here instead of CONFIG_PAGE_SHIFT
> Note: Can use ctx->count here in clamp directly due to it being signed,
> where desc.length is unsigned.
>
> I'm *assuming* get_count_order will round-up, so if max_pages is 7 (if
> non-power of two is even possible) we will really get 8 pages here?
Yes, if you have a max_pages of 7, this will round up and return to
you an order of 3
>
> Compile tested only. Will perform basic run-time test before re-submit.
>
> >> + order = get_count_order(desc.length >> CONFIG_PAGE_SHIFT);
> >> +
> >> + do {
> >> + folio = folio_alloc(GFP_KERNEL, order);
> > Folios can now be larger than one page size for readdir requests with
> > your change but I don't believe the current page copying code in fuse
> > supports this yet. For example, I think the kmapping will be
> > insufficient in fuse_copy_page() where in the current code we kmap
> > only the first page in the folio. I sent a patch for supporting large
> > folios page copying [1] and am trying to get this merged in but
> > haven't heard back about this patchset yet. In your local tests that
> > used multiple pages for the readdir request, did you run into any
> > issues or it worked fine?
>
> My tests boiled down to running strace as per above, and then some basic
> time trials using find /path/to/mount/point with and without the patch
> over a fairly large structure containing about 170m inodes. No problems
> observed. That said ... I've done similar before, and then introduced a
> major memory leak that under load destroyed 100GB of RAM in minutes.
> Thus why I'm looking for a few eyeballs on this before going to
> production (what we have works, it's just on an older kernel).
>
> If further improvements are possible that would be great, but based on
> testing this is already at least a 10x improvement on readdir() performance.
>
I think you need the patch I linked to or this could cause crashes.
The patch adds support for copying folios larger than 1 page size in
fuse. Maybe you're not running into the crash because it's going
through splice which will circumvent copying, but in the non-splice
case, I believe the kmap is insufficient when you go to do the actual
copying. IOW, I think that patch is a dependency for this one.
Thanks,
Joanne
> > [1]https://lore.kernel.org/linux-fsdevel/20250123012448.2479372-2-joannelkoong@gmail.com/
> Took a quick look, wish I could provide you some feedback but that's
> beyond my kernel skill set to just eyeball.
>
> Looks like you're primarily getting rid of the code that references the
> pages inside the folio's and just operating on the folio's directly? A
> side effect of which (your goal) is to enable larger copies rather than
> small ones?
>
> Thank you,
> Jaco
Powered by blists - more mailing lists