[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6074d653-3dd3-45a3-9241-a9e2e12252c6@linux.alibaba.com>
Date: Tue, 3 Sep 2024 16:44:35 +0800
From: Jingbo Xu <jefflexu@...ux.alibaba.com>
To: Hou Tao <houtao@...weicloud.com>, linux-fsdevel@...r.kernel.org
Cc: Miklos Szeredi <miklos@...redi.hu>, Vivek Goyal <vgoyal@...hat.com>,
Stefan Hajnoczi <stefanha@...hat.com>,
Bernd Schubert <bernd.schubert@...tmail.fm>,
"Michael S . Tsirkin" <mst@...hat.com>, Matthew Wilcox
<willy@...radead.org>, Benjamin Coddington <bcodding@...hat.com>,
linux-kernel@...r.kernel.org, virtualization@...ts.linux.dev,
houtao1@...wei.com
Subject: Re: [PATCH v4 1/2] virtiofs: use pages instead of pointer for kernel
direct IO
On 8/31/24 5:37 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@...wei.com>
>
> When trying to insert a 10MB kernel module kept in a virtio-fs with cache
> disabled, the following warning was reported:
>
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ......
> Modules linked in:
> CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
> RIP: 0010:__alloc_pages+0x2bf/0x380
> ......
> Call Trace:
> <TASK>
> ? __warn+0x8e/0x150
> ? __alloc_pages+0x2bf/0x380
> __kmalloc_large_node+0x86/0x160
> __kmalloc+0x33c/0x480
> virtio_fs_enqueue_req+0x240/0x6d0
> virtio_fs_wake_pending_and_unlock+0x7f/0x190
> queue_request_and_unlock+0x55/0x60
> fuse_simple_request+0x152/0x2b0
> fuse_direct_io+0x5d2/0x8c0
> fuse_file_read_iter+0x121/0x160
> __kernel_read+0x151/0x2d0
> kernel_read+0x45/0x50
> kernel_read_file+0x1a9/0x2a0
> init_module_from_file+0x6a/0xe0
> idempotent_init_module+0x175/0x230
> __x64_sys_finit_module+0x5d/0xb0
> x64_sys_call+0x1c3/0x9e0
> do_syscall_64+0x3d/0xc0
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> ......
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> The warning is triggered as follows:
>
> 1) syscall finit_module() handles the module insertion and it invokes
> kernel_read_file() to read the content of the module first.
>
> 2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
> passes it to kernel_read(). kernel_read() constructs a kvec iter by
> using iov_iter_kvec() and passes it to fuse_file_read_iter().
>
> 3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
> fuse_direct_io(). As for now, the maximal read size for kvec iter is
> only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
> fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
> the size of the 10MB-sized buffer in out_args[0] of a fuse request and
> passes the fuse request to virtio_fs_wake_pending_and_unlock().
>
> 4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
> queue the request. Because virtiofs need DMA-able address, so
> virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for
> all fuse args, copies these args into the bounce buffer and passed the
> physical address of the bounce buffer to virtiofsd. The total length of
> these fuse args for the passed fuse request is about 10MB, so
> copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and
> it triggers the warning in __alloc_pages():
>
> if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> return NULL;
>
> 5) virtio_fs_enqueue_req() will retry the memory allocation in a
> kworker, but it won't help, because kmalloc() will always return NULL
> due to the abnormal size and finit_module() will hang forever.
>
> A feasible solution is to limit the value of max_read for virtio-fs, so
> the length passed to kmalloc() will be limited. However it will affect
> the maximal read size for normal read. And for virtio-fs write initiated
> from kernel, it has the similar problem but now there is no way to limit
> fc->max_write in kernel.
>
> So instead of limiting both the values of max_read and max_write in
> kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as
> true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use
> pages instead of pointer to pass the KVEC_IO data.
>
> After switching to pages for KVEC_IO data, these pages will be used for
> DMA through virtio-fs. If these pages are backed by vmalloc(),
> {flush|invalidate}_kernel_vmap_range() are necessary to flush or
> invalidate the cache before the DMA operation. So add two new fields in
> fuse_args_pages to record the base address of vmalloc area and the
> condition indicating whether invalidation is needed. Perform the flush
> in fuse_get_user_pages() for write operations and the invalidation in
> fuse_release_user_pages() for read operations.
>
> It may seem necessary to introduce another field in fuse_conn to
> indicate that these KVEC_IO pages are used for DMA, However, considering
> that virtio-fs is currently the only user of use_pages_for_kvec_io, just
> reuse use_pages_for_kvec_io to indicate that these pages will be used
> for DMA.
>
> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
> Signed-off-by: Hou Tao <houtao1@...wei.com>
Tested-by: Jingbo Xu <jefflexu@...ux.alibaba.com>
> ---
> fs/fuse/file.c | 62 +++++++++++++++++++++++++++++++--------------
> fs/fuse/fuse_i.h | 6 +++++
> fs/fuse/virtio_fs.c | 1 +
> 3 files changed, 50 insertions(+), 19 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index f39456c65ed7..331208d3e4d1 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -645,7 +645,7 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
> args->out_args[0].size = count;
> }
>
> -static void fuse_release_user_pages(struct fuse_args_pages *ap,
> +static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres,
> bool should_dirty)
> {
> unsigned int i;
> @@ -656,6 +656,9 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
> if (ap->args.is_pinned)
> unpin_user_page(ap->pages[i]);
> }
> +
> + if (nres > 0 && ap->args.invalidate_vmap)
> + invalidate_kernel_vmap_range(ap->args.vmap_base, nres);
> }
>
> static void fuse_io_release(struct kref *kref)
> @@ -754,25 +757,29 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
> struct fuse_io_args *ia = container_of(args, typeof(*ia), ap.args);
> struct fuse_io_priv *io = ia->io;
> ssize_t pos = -1;
> -
> - fuse_release_user_pages(&ia->ap, io->should_dirty);
> + size_t nres;
>
> if (err) {
> /* Nothing */
> } else if (io->write) {
> if (ia->write.out.size > ia->write.in.size) {
> err = -EIO;
> - } else if (ia->write.in.size != ia->write.out.size) {
> - pos = ia->write.in.offset - io->offset +
> - ia->write.out.size;
> + } else {
> + nres = ia->write.out.size;
> + if (ia->write.in.size != ia->write.out.size)
> + pos = ia->write.in.offset - io->offset +
> + ia->write.out.size;
> }
> } else {
> u32 outsize = args->out_args[0].size;
>
> + nres = outsize;
> if (ia->read.in.size != outsize)
> pos = ia->read.in.offset - io->offset + outsize;
> }
>
> + fuse_release_user_pages(&ia->ap, err ?: nres, io->should_dirty);
> +
> fuse_aio_complete(io, err, pos);
> fuse_io_free(ia);
> }
> @@ -1467,24 +1474,37 @@ static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
>
> static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> size_t *nbytesp, int write,
> - unsigned int max_pages)
> + unsigned int max_pages,
> + bool use_pages_for_kvec_io)
> {
> + bool flush_or_invalidate = false;
> size_t nbytes = 0; /* # bytes already packed in req */
> ssize_t ret = 0;
>
> - /* Special case for kernel I/O: can copy directly into the buffer */
> + /* Special case for kernel I/O: can copy directly into the buffer.
> + * However if the implementation of fuse_conn requires pages instead of
> + * pointer (e.g., virtio-fs), use iov_iter_extract_pages() instead.
> + */
> if (iov_iter_is_kvec(ii)) {
> - unsigned long user_addr = fuse_get_user_addr(ii);
> - size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
> + void *user_addr = (void *)fuse_get_user_addr(ii);
>
> - if (write)
> - ap->args.in_args[1].value = (void *) user_addr;
> - else
> - ap->args.out_args[0].value = (void *) user_addr;
> + if (!use_pages_for_kvec_io) {
> + size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
>
> - iov_iter_advance(ii, frag_size);
> - *nbytesp = frag_size;
> - return 0;
> + if (write)
> + ap->args.in_args[1].value = user_addr;
> + else
> + ap->args.out_args[0].value = user_addr;
> +
> + iov_iter_advance(ii, frag_size);
> + *nbytesp = frag_size;
> + return 0;
> + }
> +
> + if (is_vmalloc_addr(user_addr)) {
> + ap->args.vmap_base = user_addr;
> + flush_or_invalidate = true;
Could we move flush_kernel_vmap_range() upon here, so that
flush_or_invalidate is not needed anymore and the code looks cleaner?
> + }
> }
>
> while (nbytes < *nbytesp && ap->num_pages < max_pages) {
> @@ -1513,6 +1533,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> (PAGE_SIZE - ret) & (PAGE_SIZE - 1);
> }
>
> + if (write && flush_or_invalidate)
> + flush_kernel_vmap_range(ap->args.vmap_base, nbytes);
> +
> + ap->args.invalidate_vmap = !write && flush_or_invalidate;
How about initializing vmap_base only when the data buffer is vmalloced
and it's a read request? In this case invalidate_vmap is no longer needed.
> ap->args.is_pinned = iov_iter_extract_will_pin(ii);
> ap->args.user_pages = true;
> if (write)
> @@ -1581,7 +1605,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
> size_t nbytes = min(count, nmax);
>
> err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
> - max_pages);
> + max_pages, fc->use_pages_for_kvec_io);
> if (err && !nbytes)
> break;
>
> @@ -1595,7 +1619,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
> }
>
> if (!io->async || nres < 0) {
> - fuse_release_user_pages(&ia->ap, io->should_dirty);
> + fuse_release_user_pages(&ia->ap, nres, io->should_dirty);
> fuse_io_free(ia);
> }
> ia = NULL;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index f23919610313..79add14c363f 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -309,9 +309,12 @@ struct fuse_args {
> bool may_block:1;
> bool is_ext:1;
> bool is_pinned:1;
> + bool invalidate_vmap:1;
> struct fuse_in_arg in_args[3];
> struct fuse_arg out_args[2];
> void (*end)(struct fuse_mount *fm, struct fuse_args *args, int error);
> + /* Used for kvec iter backed by vmalloc address */
> + void *vmap_base;
> };
>
> struct fuse_args_pages {
> @@ -860,6 +863,9 @@ struct fuse_conn {
> /** Passthrough support for read/write IO */
> unsigned int passthrough:1;
>
> + /* Use pages instead of pointer for kernel I/O */
> + unsigned int use_pages_for_kvec_io:1;
Maybe we need a better (actually shorter) name for this flag. kvec_pages?
> +
> /** Maximum stack depth for passthrough backing files */
> int max_stack_depth;
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index dd5260141615..43d66ab5e891 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1568,6 +1568,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
> fc->delete_stale = true;
> fc->auto_submounts = true;
> fc->sync_fs = true;
> + fc->use_pages_for_kvec_io = true;
>
> /* Tell FUSE to split requests that exceed the virtqueue's size */
> fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,
--
Thanks,
Jingbo
Powered by blists - more mailing lists