lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 12 Jun 2020 14:02:13 +0200
From:   Arnd Bergmann <arnd@...db.de>
To:     afzal mohammed <afzal.mohd.ma@...il.com>
Cc:     Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Linus Walleij <linus.walleij@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        Nicolas Pitre <nico@...xnic.net>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>
Subject: Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()

On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@...il.com> wrote:
>
> copy_{from,to}_user() uaccess helpers are implemented by user page
> pinning, followed by temporary kernel mapping & then memcpy(). This
> helps to achieve user page copy when current virtual address mapping
> of the CPU excludes user pages.
>
> Performance wise, results are not encouraging, 'dd' on tmpfs results,
>
> ARM Cortex-A8, BeagleBone White (256MiB RAM):
> w/o series - ~29.5 MB/s
> w/ series - ~20.5 MB/s
> w/ series & highmem disabled - ~21.2 MB/s
>
> On Cortex-A15(2GiB RAM) in QEMU:
> w/o series - ~4 MB/s
> w/ series - ~2.6 MB/s
>
> Roughly a one-third drop in performance. Disabling highmem improves
> performance only slightly.
>
> 'hackbench' also showed a similar pattern.
>
> uaccess routines using page pinning & temporary kernel mapping is not
> something new, it has been done long long ago by Ingo [1] as part of
> 4G/4G user/kernel mapping implementation on x86, though not merged in
> mainline.
>
> [1] https://lore.kernel.org/lkml/Pine.LNX.4.44.0307082332450.17252-100000@localhost.localdomain/

Nice changelog text! I agree the performance drop is not great.
There are probably some things that can be done to optimize it,
but I guess most of the overhead is from the page table operations
and cannot be avoided.

What was the exact 'dd' command you used, in particular the block size?
Note that by default, 'dd' will request 512 bytes at a time, so you usually
only access a single page. It would be interesting to see the overhead with
other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'.

If you want to drill down into where exactly the overhead is (i.e.
get_user_pages or kmap_atomic, or something different), using
'perf record dd ..', and 'perf report' may be helpful.

> +static int copy_chunk_from_user(unsigned long from, int len, void *arg)
> +{
> +       unsigned long *to_ptr = arg, to = *to_ptr;
> +
> +       memcpy((void *) to, (void *) from, len);
> +       *to_ptr += len;
> +       return 0;
> +}
> +
> +static int copy_chunk_to_user(unsigned long to, int len, void *arg)
> +{
> +       unsigned long *from_ptr = arg, from = *from_ptr;
> +
> +       memcpy((void *) to, (void *) from, len);
> +       *from_ptr += len;
> +       return 0;
> +}

Will gcc optimize away the indirect function call and inline everything?
If not, that would be a small part of the overhead.

> +unsigned long gup_kmap_copy_from_user(void *to, const void __user *from, unsigned long n)
> +{
> +       struct page **pages;
> +       int num_pages, ret, i;
> +
> +       if (uaccess_kernel()) {
> +               memcpy(to, (__force void *)from, n);
> +               return 0;
> +       }
> +
> +       num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) -
> +                                (unsigned long)from / PAGE_SIZE;

Make sure this doesn't turn into actual division operations but uses shifts.
It might even be clearer here to open-code the shift operation so readers
can see what this is meant to compile into.

> +       pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
> +       if (!pages)
> +               goto end;

Another micro-optimization may be to avoid the kmalloc for the common case,
e.g. anything with "num_pages <= 64", using an array on the stack.

> +       ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages);
> +       if (ret < 0)
> +               goto free_pages;
> +
> +       if (ret != num_pages) {
> +               num_pages = ret;
> +               goto put_pages;
> +       }

I think this is technically incorrect: if get_user_pages_fast() only
gets some of the
pages, you should continue with the short buffer and return the number
of remaining
bytes rather than not copying anything. I think you did that correctly
for a failed
kmap_atomic(), but this has to use the same logic.

     Arnd

Powered by blists - more mailing lists