[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20180316193948.GA12435@light.dominikbrodowski.net>
Date: Fri, 16 Mar 2018 20:39:48 +0100
From: Dominik Brodowski <linux@...inikbrodowski.net>
To: David Miller <davem@...emloft.net>
Cc: linux-kernel@...r.kernel.org, torvalds@...ux-foundation.org,
netdev@...r.kernel.org
Subject: Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2
== netdev)
On Fri, Mar 16, 2018 at 02:30:21PM -0400, David Miller wrote:
> From: Dominik Brodowski <linux@...inikbrodowski.net>
> Date: Fri, 16 Mar 2018 18:05:52 +0100
>
> > The rationale of this change is described in patch 1 of part 1[*] as follows:
> >
> > The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
> > and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
> > through kernel entry points, but not from the kernel itself. This
> > will allow cleanups and optimizations to the entry paths *and* to
> > the parts of the kernel code which currently need to pretend to be
> > userspace in order to make use of syscalls.
> >
> > At present, these patches are based on v4.16-rc5; there is one trivial
> > conflict against net-next. Dave, I presume that you prefer to take them
> > through net-next? If you want to, I can re-base them against net-next.
> > If you prefer otherwise, though, I can route them as part of my whole
> > syscall series.
>
> So the transformations themeselves are relatively trivial, so on that
> aspect I don't have any problems with these changes.
Thank you for your fast feedback.
> But overall I have to wonder.
>
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme. For
> example, passing values in registers instead of on the stack.
Well, sort of. Currently, x86-64 decodes all six registers unconditionally:
regs->ax = sys_call_table[nr](
regs->di, regs->si, regs->dx,
regs->r10, regs->r8, regs->r9);
so that in do_syscall_64(), we have to get six parameters from the
stack:
mov 0x38(%rbx),%rcx
mov 0x60(%rbx),%rdx
mov 0x68(%rbx),%rsi
mov 0x70(%rbx),%rdi
mov 0x40(%rbx),%r9
mov 0x48(%rbx),%r8
Instead, the aim is to do
regs->ax = sys_call_table[nr](regs)
... which results in just a register rename operation:
mov %rbp,%rdi
> But in situations where you split out the system call function
> completely into one of these "helpers", the compiler is going
> to have two choices:
>
> 1) Expand the helper into the syscall function inline, thus we end up
> with two copies of the function.
That's only sensible for very short stubs, which just call another function
(e.g. __compat_sys_sendmsg()).
> 2) Call the helper from the syscall function. Well, then the compiler
> will need to pop the syscal obtained arguments from the registers
> onto the stack.
>
> So this doesn't seem like such a total win to me.
>
> Maybe you can explain things better to ease my concerns.
For example, for sys_recv() and sys_recvfrom(), if all is complete, this
results in:
sys_x86_64_recv:
callq <__fentry__>
/* decode struct pt_regs for exactly those parameters
* we care about
*/
mov 0x38(%rdi),%rcx
xor %r9d,%r9d
xor %r8d,%r8d
mov 0x60(%rdi),%rdx
mov 0x68(%rdi),%rsi
mov 0x70(%rdi),%rdi
/* call __sys_recvfrom */
callq <__sys_recvfrom>
/* cleanup and return */
cltq
retq
That's only obtaining four entries from the stack, and two register clearing
operations; sys_x86_64_recvfrom is similar (6 movs from stack, one register
rename mov, no xor).
__sys_recvfrom() then does the actual work, starting with pushing some
register contect out of the way and moving registers around, more or less
what SyS_recvfrom() does today.
So the result is nothing spectacular or unusual, but pretty equivalent and
possibly even shorter compared to current codepath.
Thanks,
Dominik
Powered by blists - more mailing lists