[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZrY7v93aDtZbD4-qKAaJHcGUbWTQinEVRG4Tiiy6m2BFA@mail.gmail.com>
Date: Wed, 5 Nov 2025 07:37:40 -0800
From: Caleb Sander Mateos <csander@...estorage.com>
To: Jens Axboe <axboe@...nel.dk>
Cc: Ming Lei <ming.lei@...hat.com>, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
On Wed, Nov 5, 2025 at 7:26 AM Jens Axboe <axboe@...nel.dk> wrote:
>
> On 11/4/25 6:48 PM, Ming Lei wrote:
> > On Mon, Nov 03, 2025 at 08:40:30AM -0800, Caleb Sander Mateos wrote:
> >> On Fri, Oct 31, 2025 at 4:04?PM Ming Lei <ming.lei@...hat.com> wrote:
> >>>
> >>> On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> >>>> On Thu, Oct 30, 2025 at 8:45?PM Ming Lei <ming.lei@...hat.com> wrote:
> >>>>>
> >>>>> On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> >>>>>> ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> >>>>>> iov_iter_get_pages2() to extract the pages from the iov_iter and
> >>>>>> memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> >>>>>> Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> >>>>>> user page reference count increments and decrements and needing to split
> >>>>>> the memcpy() at user page boundaries. It also simplifies the code
> >>>>>> considerably.
> >>>>>>
> >>>>>> Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> >>>>>> ---
> >>>>>> drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> >>>>>> 1 file changed, 14 insertions(+), 48 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>>>>> index 0c74a41a6753..852350e639d6 100644
> >>>>>> --- a/drivers/block/ublk_drv.c
> >>>>>> +++ b/drivers/block/ublk_drv.c
> >>>>>> @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> >>>>>> .open = ublk_open,
> >>>>>> .free_disk = ublk_free_disk,
> >>>>>> .report_zones = ublk_report_zones,
> >>>>>> };
> >>>>>>
> >>>>>> -#define UBLK_MAX_PIN_PAGES 32
> >>>>>> -
> >>>>>> struct ublk_io_iter {
> >>>>>> - struct page *pages[UBLK_MAX_PIN_PAGES];
> >>>>>> struct bio *bio;
> >>>>>> struct bvec_iter iter;
> >>>>>> };
> >>>>>
> >>>>> ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> >>>>> perf drop.
> >>>>
> >>>> As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> >>>> pinning entirely. It calls copy_to_user_iter() for each contiguous
> >>>> user address range:
> >>>>
> >>>> size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> >>>> {
> >>>> if (WARN_ON_ONCE(i->data_source))
> >>>> return 0;
> >>>> if (user_backed_iter(i))
> >>>> might_fault();
> >>>> return iterate_and_advance(i, bytes, (void *)addr,
> >>>> copy_to_user_iter, memcpy_to_iter);
> >>>> }
> >>>>
> >>>> Which just checks that the address range doesn't include any kernel
> >>>> addresses and then memcpy()s directly via the userspace virtual
> >>>> addresses:
> >>>>
> >>>> static __always_inline
> >>>> size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> >>>> size_t len, void *from, void *priv2)
> >>>> {
> >>>> if (should_fail_usercopy())
> >>>> return len;
> >>>> if (access_ok(iter_to, len)) {
> >>>> from += progress;
> >>>> instrument_copy_to_user(iter_to, from, len);
> >>>> len = raw_copy_to_user(iter_to, from, len);
> >>>> }
> >>>> return len;
> >>>> }
> >>>>
> >>>> static __always_inline __must_check unsigned long
> >>>> raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> >>>> {
> >>>> return copy_user_generic((__force void *)dst, src, size);
> >>>> }
> >>>>
> >>>> static __always_inline __must_check unsigned long
> >>>> copy_user_generic(void *to, const void *from, unsigned long len)
> >>>> {
> >>>> stac();
> >>>> /*
> >>>> * If CPU has FSRM feature, use 'rep movs'.
> >>>> * Otherwise, use rep_movs_alternative.
> >>>> */
> >>>> asm volatile(
> >>>> "1:\n\t"
> >>>> ALTERNATIVE("rep movsb",
> >>>> "call rep_movs_alternative",
> >>>> ALT_NOT(X86_FEATURE_FSRM))
> >>>> "2:\n"
> >>>> _ASM_EXTABLE_UA(1b, 2b)
> >>>> :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> >>>> : : "memory", "rax");
> >>>> clac();
> >>>> return len;
> >>>> }
> >>>>
> >>>> Am I missing something?
> >>>
> >>> page is allocated & mapped in page fault handler.
> >>
> >> Right, physical pages certainly need to be allocated for the virtual
> >> address range being copied to/from. But that would have happened
> >> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> >> you point out, in the common case that the virtual pages are already
> >> mapped to physical pages, the copy won't cause any page faults.
> >>
> >>>
> >>> However, in typical cases, pages in io buffer shouldn't be swapped out
> >>> frequently, so this cleanup may be good, I will run some perf test.
> >>
> >> Thanks for testing.
> >
> > `fio/t/io_uring` shows 40% improvement on `./kublk -t null -q 2` with this
> > patch in my test VM, so looks very nice improvement.
> >
> > Also it works well by forcing to pass IOSQE_ASYNC on the ublk uring_cmd,
> > and this change is correct because the copy is guaranteed to be done in ublk
> > daemon context.
>
> We good to queue this up then?
Let me write a v2 implementing Ming's suggestions to use
copy_page_{to,from}_iter() and get rid of the open-coded bvec
iteration.
Thanks,
Caleb
Powered by blists - more mailing lists