lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJfpegsrA93ZzWGwgTNdJgPK0UXhiTSK0QV--k=YpaucnrNj5A@mail.gmail.com>
Date:   Thu, 27 Jul 2023 21:17:44 +0200
From:   Miklos Szeredi <miklos@...redi.hu>
To:     Jaco Kroon <jaco@....co.za>
Cc:     linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Randy Dunlap <rdunlap@...radead.org>,
        Bernd Schubert <bernd.schubert@...tmail.fm>,
        Antonio SJ Musumeci <trapexit@...wn.link>
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].

On Thu, 27 Jul 2023 at 18:58, Jaco Kroon <jaco@....co.za> wrote:
>
> Hi,
>
> On 2023/07/27 17:35, Miklos Szeredi wrote:
> > On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <jaco@....co.za> wrote:
> >> This patch does not mess with the caching infrastructure like the
> >> previous one, which we believe caused excessive CPU and broke directory
> >> listings in some cases.
> >>
> >> This version only affects the uncached read, which then during parse adds an
> >> entry at a time to the cached structures by way of copying, and as such,
> >> we believe this should be sufficient.
> >>
> >> We're still seeing cases where getdents64 takes ~10s (this was the case
> >> in any case without this patch, the difference now that we get ~500
> >> entries for that time rather than the 14-18 previously).  We believe
> >> that that latency is introduced on glusterfs side and is under separate
> >> discussion with the glusterfs developers.
> >>
> >> This is still a compile-time option, but a working one compared to
> >> previous patch.  For now this works, but it's not recommended for merge
> >> (as per email discussion).
> >>
> >> This still uses alloc_pages rather than kvmalloc/kvfree.
> >>
> >> Signed-off-by: Jaco Kroon <jaco@....co.za>
> >> ---
> >>   fs/fuse/Kconfig   | 16 ++++++++++++++++
> >>   fs/fuse/readdir.c | 18 ++++++++++++------
> >>   2 files changed, 28 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> >> index 038ed0b9aaa5..0783f9ee5cd3 100644
> >> --- a/fs/fuse/Kconfig
> >> +++ b/fs/fuse/Kconfig
> >> @@ -18,6 +18,22 @@ config FUSE_FS
> >>            If you want to develop a userspace FS, or if you want to use
> >>            a filesystem based on FUSE, answer Y or M.
> >>
> >> +config FUSE_READDIR_ORDER
> >> +       int
> >> +       range 0 5
> >> +       default 5
> >> +       help
> >> +               readdir performance varies greatly depending on the size of the read.
> >> +               Larger buffers results in larger reads, thus fewer reads and higher
> >> +               performance in return.
> >> +
> >> +               You may want to reduce this value on seriously constrained memory
> >> +               systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> >> +
> >> +               This value reprents the order of the number of pages to allocate (ie,
> >> +               the shift value).  A value of 0 is thus 1 page (4KiB) where 5 is 32
> >> +               pages (128KiB).
> >> +
> >>   config CUSE
> >>          tristate "Character device in Userspace support"
> >>          depends on FUSE_FS
> >> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> >> index dc603479b30e..47cea4d91228 100644
> >> --- a/fs/fuse/readdir.c
> >> +++ b/fs/fuse/readdir.c
> >> @@ -13,6 +13,12 @@
> >>   #include <linux/pagemap.h>
> >>   #include <linux/highmem.h>
> >>
> >> +#define READDIR_PAGES_ORDER            CONFIG_FUSE_READDIR_ORDER
> >> +#define READDIR_PAGES                  (1 << READDIR_PAGES_ORDER)
> >> +#define READDIR_PAGES_SIZE             (PAGE_SIZE << READDIR_PAGES_ORDER)
> >> +#define READDIR_PAGES_MASK             (READDIR_PAGES_SIZE - 1)
> >> +#define READDIR_PAGES_SHIFT            (PAGE_SHIFT + READDIR_PAGES_ORDER)
> >> +
> >>   static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> >>   {
> >>          struct fuse_conn *fc = get_fuse_conn(dir);
> >> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> >>          struct fuse_mount *fm = get_fuse_mount(inode);
> >>          struct fuse_io_args ia = {};
> >>          struct fuse_args_pages *ap = &ia.ap;
> >> -       struct fuse_page_desc desc = { .length = PAGE_SIZE };
> >> +       struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
> > Does this really work?  I would've thought we are relying on single
> > page lengths somewhere.
> Based on my testing yes.  Getting just under 500 entries per
> getdents64() call from userspace vs 14-18 before I guess the answer is yes.
> >
> >>          u64 attr_version = 0;
> >>          bool locked;
> >>
> >> -       page = alloc_page(GFP_KERNEL);
> >> +       page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
> >>          if (!page)
> >>                  return -ENOMEM;
> >>
> >>          plus = fuse_use_readdirplus(inode, ctx);
> >>          ap->args.out_pages = true;
> >> -       ap->num_pages = 1;
> >> +       ap->num_pages = READDIR_PAGES;
> > No.  This is the array lenght, which is 1.  This is the hack I guess,
> > which makes the above trick work.
>
> Oh?  So the page referenced above isn't an array of pages?  It's
> actually a single piece of contiguous memory?

Yes.

>
> > Better use kvmalloc, which might have a slightly worse performance
> > than a large page, but definitely not worse than the current single
> > page.
>
> Which returns a void*, not struct page* - guess this can be converted
> using __virt_to_page (iirc)?

No, it cannot be converted to a page or a page array, use it as just a
piece of kernel memory.

Which means:

 - don't set ->args.out_pages and ->num_pages
 - instead set  ->args.out_args[0].value to the allocated pointer

and that should be it (fingers crossed).

>
> > If we want to optimize the overhead of kvmalloc (and it's a big if)
> > then the parse_dir*file() functions would need to be converted to
> > using a page array instead of a plain kernel pointer, which would add
> > some complexity for sure.
>
> Sorry, I read the above as "I'm surprised this works at all and you're
> not kernel panicking all over the show", this is probably the most
> ambitious kernel patch I've attempted to date.

Good start, you'll learn.

Thanks,
Miklos

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ