[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181122103231.GA102790@gmail.com>
Date: Thu, 22 Nov 2018 11:32:31 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: pabeni@...hat.com, Jens Axboe <axboe@...nel.dk>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, bp@...en8.de,
Peter Anvin <hpa@...or.com>,
the arch/x86 maintainers <x86@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrew Lutomirski <luto@...nel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>, dvlasenk@...hat.com,
brgerst@...il.com,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > It might be interesting to just change raw_copy_to/from_user() to
> > handle a lot more cases (in particular, handle cases where 'size' is
> > 8-byte aligned). The special cases we *do* have may not be the right
> > ones (the 10-byte case in particular looks odd).
> >
> > For example, instead of having a "if constant size is 8 bytes, do one
> > get/put_user()" case, we might have a "if constant size is < 64 just
> > unroll it into get/put_user()" calls.
>
> Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
> the constant size cases ever trigger at all the way they are set up
> now.
Side note, there's one artifact the patch exposes: some of the
__builtin_constant_p() checks are imprecise and don't trigger at the
early stage where GCC checks them, but the lenght is actually known to
the compiler at later optimization stages.
This means that with Jen's patch some of the length checks go away. I
checked x86-64 defconfig and a distro config, and the numbers were ~7%
and 10%, so not a big effect.
The kernel text size reduction with Jen's patch is small but real:
text data bss dec hex filename
19572694 11516934 19873888 50963516 309a43c vmlinux.before
19572468 11516934 19873888 50963290 309a35a vmlinux.after
But I checked the disassembly, and it's not a real win, the new code is
actually more complex than the old one, as expected, but GCC (7.3.0) does
some particularly stupid things which bloats the generated code.
> I do have a random patch that makes "unsafe_put_user()" actually use
> "asm goto" for the error case, and that, together with the attached
> patch seems to generate fairly nice code, but even then it would
> depend on gcc actually unrolling things (which we do *not* want in
> general).
>
> But for a 32-byte user copy (cp_old_stat), and that
> INLINE_COPY_TO_USER, it generates this:
>
> stac
> movl $32, %edx #, size
> movq %rsp, %rax #, src
> .L201:
> movq (%rax), %rcx # MEM[base: src_155, offset: 0B],
> MEM[base: src_155, offset: 0B]
> 1: movq %rcx,0(%rbp) # MEM[base: src_155, offset: 0B],
> MEM[(struct __large_struct *)dst_156]
> ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess" #
>
> addq $8, %rax #, src
> addq $8, %rbp #, statbuf
> subq $8, %rdx #, size
> jne .L201 #,
> clac
>
> which is actually fairly close to "optimal".
Yeah, that looks pretty sweet!
> Random patch (with my "asm goto" hack included) attached, in case
> people want to play with it.
Doesn't even look all that hacky to me. Any hack in it that I didn't
notice? :-)
The only question is the inlining overhead - will try to measure that.
> Impressively, it actually removes more lines of code than it adds. But
> I didn't actually check whether the end result *works*, so hey..
Most of the linecount reduction appears to come from the simplification
of the unroll loop and moving it into C, from a 6-way hard-coded copy
routine:
> - switch (size) {
> - case 1:
> - case 2:
> - case 4:
> - case 8:
> - case 10:
> - case 16:
to a more flexible 4-way loop unrolling:
> + while (size >= sizeof(unsigned long)) {
> + while (size >= sizeof(unsigned int)) {
> + while (size >= sizeof(unsigned short)) {
> + while (size >= sizeof(unsigned char)) {
Which is a nice improvement in itself.
> + user_access_begin();
> + if (dirent)
> + unsafe_put_user(offset, &dirent->d_off, efault_end);
> dirent = buf->current_dir;
> + unsafe_put_user(d_ino, &dirent->d_ino, efault_end);
> + unsafe_put_user(reclen, &dirent->d_reclen, efault_end);
> + unsafe_put_user(0, dirent->d_name + namlen, efault_end);
> + unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, efault_end);
> + user_access_end();
> +
> if (copy_to_user(dirent->d_name, name, namlen))
> goto efault;
> buf->previous = dirent;
> dirent = (void __user *)dirent + reclen;
> buf->current_dir = dirent;
> buf->count -= reclen;
> return 0;
> +efault_end:
> + user_access_end();
> efault:
> buf->error = -EFAULT;
> return -EFAULT;
In terms of high level APIs, could we perhaps use the opportunity to
introduce unsafe_write_user() instead, which would allow us to write it
as:
unsafe_write_user(&dirent->d_ino, d_ino, efault_end);
unsafe_write_user(&dirent->d_reclen, reclen, efault_end);
unsafe_write_user(dirent->d_name + namlen, 0, efault_end);
unsafe_write_user((char __user *)dirent + reclen - 1, d_type, efault_end);
if (copy_to_user(dirent->d_name, name, namlen))
goto efault;
This gives it the regular 'VAR = VAL;' notation of C assigments, instead
of the weird historical reverse notation that put_user()/get_user() uses.
Note how this newfangled ordering now matches the 'copy_to_user()'
natural C-assignment parameter order that comes straight afterwards and
makes it obvious that the d->name+namelen was writing the delimiter at
the end.
I think we even had bugs from put_user() ordering mixups?
Or is it too late to try to fix this particular mistake?
Thanks,
Ingo
Powered by blists - more mailing lists