[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ae356339-6fb1-d098-21c9-ca88e7a3bf4f@uls.co.za>
Date: Wed, 26 Jul 2023 13:43:20 +0200
From: Jaco Kroon <jaco@....co.za>
To: Miklos Szeredi <miklos@...redi.hu>, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.
Hi,
Just to give some context, we've got maildir folders running on top of
glusterfs. Without this a "medium" sized folder can take several
minutes to go through readdir (getdents64) and each system call will
return 14-18 entries at a time. These calls (as measures using strace
-T -p) takes anywhere from around 150 micro-seconds to several
milli-seconds. Inter-server round-trip time (as measured with ping) is
generally 100-120 micro-seconds so 150 µs is probably a good lower limit.
I've tested the below patch from a more remote location (7-8ms ping
round-trip) and in spite of this massively increased latency a readdir
iteration over a small folder (a few hundred entries) was still on par
with the local case, usually even marginally better (10%), where on
larger folders the difference became much more pronounced, with
performance for the remote location at times being drastically better
than for the "close" location.
This has a few benefits overall that I see:
1. This enables glusterfs to internally read and process larger
chunks. Whether this is happening I don't know (I have a separate
ongoing discussion with the developers on that side to see what can be
done inside of glusterfs itself to improve the mechanisms on the other
side of fuse).
2. Fewer overall system calls (by an order of up to 32 fewer
getdents64()) calls for userspace to get a full directory listing, in
cases where there's 100k+ files in a folder this makes a HUGE difference
(14-18 entries vs ~500 entries per call, so >5000 calls vs 200).
3. Fewer fuse messages being passed over the fuse link.
One concerns I have is that I don't understand all of the caching and
stuff going on, and I'm typically as per strace seeing less than 64KB of
data being returned to userspace, so I'm not sure there is a direct
correlation between the FUSE readdir size and that from glibc to kernel,
and I'm slightly worried that the caching may now be broken due to
smaller than READDIR_PAGES_SIZE records being cached, with that said,
successive readdir() iterations from userspace provided identical
results so I *think* this should be OK. Someone with a better
understanding should please confirm.
I made the option configurable (but max it out by default) in case
memory constrained systems may need to drop the 128KiB value.
I suspect the discrepancy may also relate to the way in which glusterfs
merges directory listings from the underlying bricks where data is
actually stored.
Without this patch the remote system is orders of magnitude slower that
the close together systems.
Kind regards,
Jaco
On 2023/07/26 12:59, Jaco Kroon wrote:
> Signed-off-by: Jaco Kroon <jaco@....co.za>
> ---
> fs/fuse/Kconfig | 16 ++++++++++++++++
> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
> 2 files changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 038ed0b9aaa5..0783f9ee5cd3 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -18,6 +18,22 @@ config FUSE_FS
> If you want to develop a userspace FS, or if you want to use
> a filesystem based on FUSE, answer Y or M.
>
> +config FUSE_READDIR_ORDER
> + int
> + range 0 5
> + default 5
> + help
> + readdir performance varies greatly depending on the size of the read.
> + Larger buffers results in larger reads, thus fewer reads and higher
> + performance in return.
> +
> + You may want to reduce this value on seriously constrained memory
> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> + This value reprents the order of the number of pages to allocate (ie,
> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
> + pages (128KiB).
> +
> config CUSE
> tristate "Character device in Userspace support"
> depends on FUSE_FS
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index dc603479b30e..98c62b623240 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -13,6 +13,12 @@
> #include <linux/pagemap.h>
> #include <linux/highmem.h>
>
> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
> +
> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> {
> struct fuse_conn *fc = get_fuse_conn(dir);
> @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file,
> }
> version = fi->rdc.version;
> size = fi->rdc.size;
> - offset = size & ~PAGE_MASK;
> - index = size >> PAGE_SHIFT;
> + offset = size & ~READDIR_PAGES_MASK;
> + index = size >> READDIR_PAGES_SHIFT;
> /* Dirent doesn't fit in current page? Jump to next page. */
> - if (offset + reclen > PAGE_SIZE) {
> + if (offset + reclen > READDIR_PAGES_SIZE) {
> index++;
> offset = 0;
> }
> @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file,
> }
> memcpy(addr + offset, dirent, reclen);
> kunmap_local(addr);
> - fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
> + fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
> fi->rdc.pos = dirent->off;
> unlock:
> spin_unlock(&fi->rdc.lock);
> @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos)
> }
>
> fi->rdc.cached = true;
> - end = ALIGN(fi->rdc.size, PAGE_SIZE);
> + end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
> spin_unlock(&fi->rdc.lock);
>
> /* truncate unused tail of cache */
> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> struct fuse_mount *fm = get_fuse_mount(inode);
> struct fuse_io_args ia = {};
> struct fuse_args_pages *ap = &ia.ap;
> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
> u64 attr_version = 0;
> bool locked;
>
> - page = alloc_page(GFP_KERNEL);
> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
> if (!page)
> return -ENOMEM;
>
> plus = fuse_use_readdirplus(inode, ctx);
> ap->args.out_pages = true;
> - ap->num_pages = 1;
> + ap->num_pages = READDIR_PAGES;
> ap->pages = &page;
> ap->descs = &desc;
> if (plus) {
> attr_version = fuse_get_attr_version(fm->fc);
> - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
> FUSE_READDIRPLUS);
> } else {
> - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
> FUSE_READDIR);
> }
> locked = fuse_lock_inode(inode);
> @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
> void *addr, unsigned int size,
> struct dir_context *ctx)
> {
> - unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
> + unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
> enum fuse_parse_result res = FOUND_NONE;
>
> WARN_ON(offset >= size);
> @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>
> WARN_ON(fi->rdc.size < ff->readdir.cache_off);
>
> - index = ff->readdir.cache_off >> PAGE_SHIFT;
> + index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;
>
> - if (index == (fi->rdc.size >> PAGE_SHIFT))
> - size = fi->rdc.size & ~PAGE_MASK;
> + if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
> + size = fi->rdc.size & ~READDIR_PAGES_MASK;
> else
> - size = PAGE_SIZE;
> + size = READDIR_PAGES_SIZE;
> spin_unlock(&fi->rdc.lock);
>
> /* EOF? */
> - if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
> + if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
> return 0;
>
> page = find_get_page_flags(file->f_mapping, index,
> @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
> if (res == FOUND_ALL)
> return 0;
>
> - if (size == PAGE_SIZE) {
> + if (size == READDIR_PAGES_SIZE) {
> /* We hit end of page: skip to next page. */
> - ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
> + ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
> goto retry;
> }
>
Powered by blists - more mailing lists