linux-kernel - Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2 == netdev)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20180316193948.GA12435@light.dominikbrodowski.net>
Date:   Fri, 16 Mar 2018 20:39:48 +0100
From:   Dominik Brodowski <linux@...inikbrodowski.net>
To:     David Miller <davem@...emloft.net>
Cc:     linux-kernel@...r.kernel.org, torvalds@...ux-foundation.org,
        netdev@...r.kernel.org
Subject: Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2
 == netdev)

On Fri, Mar 16, 2018 at 02:30:21PM -0400, David Miller wrote:
> From: Dominik Brodowski <linux@...inikbrodowski.net>
> Date: Fri, 16 Mar 2018 18:05:52 +0100
> 
> > The rationale of this change is described in patch 1 of part 1[*] as follows:
> > 
> > 	The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
> > 	and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
> > 	through kernel entry points, but not from the kernel itself. This
> > 	will allow cleanups and optimizations to the entry paths *and* to
> > 	the parts of the kernel code which currently need to pretend to be
> > 	userspace in order to make use of syscalls.
> > 
> > At present, these patches are based on v4.16-rc5; there is one trivial
> > conflict against net-next. Dave, I presume that you prefer to take them
> > through net-next? If you want to, I can re-base them against net-next.
> > If you prefer otherwise, though, I can route them as part of my whole
> > syscall series.
> 
> So the transformations themeselves are relatively trivial, so on that
> aspect I don't have any problems with these changes.

Thank you for your fast feedback.

> But overall I have to wonder.
> 
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme.  For
> example, passing values in registers instead of on the stack.

Well, sort of. Currently, x86-64 decodes all six registers unconditionally:

		regs->ax = sys_call_table[nr](
			regs->di, regs->si, regs->dx,
			regs->r10, regs->r8, regs->r9);

so that in do_syscall_64(), we have to get six parameters from the
stack:

	mov    0x38(%rbx),%rcx
	mov    0x60(%rbx),%rdx
	mov    0x68(%rbx),%rsi
	mov    0x70(%rbx),%rdi
	mov    0x40(%rbx),%r9
	mov    0x48(%rbx),%r8

Instead, the aim is to do

	regs->ax = sys_call_table[nr](regs)

... which results in just a register rename operation:

	mov    %rbp,%rdi

> But in situations where you split out the system call function
> completely into one of these "helpers", the compiler is going
> to have two choices:
> 
> 1) Expand the helper into the syscall function inline, thus we end up
>    with two copies of the function.

That's only sensible for very short stubs, which just call another function
(e.g. __compat_sys_sendmsg()).

> 2) Call the helper from the syscall function.  Well, then the compiler
>    will need to pop the syscal obtained arguments from the registers
>    onto the stack.
> 
> So this doesn't seem like such a total win to me.
> 
> Maybe you can explain things better to ease my concerns.

For example, for sys_recv() and sys_recvfrom(), if all is complete, this
results in:

sys_x86_64_recv:
	callq <__fentry__>
	/* decode struct pt_regs for exactly those parameters
	 * we care about
	 */
	mov    0x38(%rdi),%rcx
	xor    %r9d,%r9d
	xor    %r8d,%r8d
	mov    0x60(%rdi),%rdx
	mov    0x68(%rdi),%rsi
	mov    0x70(%rdi),%rdi

	/* call __sys_recvfrom */
	callq  <__sys_recvfrom>

	/* cleanup and return */
	cltq
	retq

That's only obtaining four entries from the stack, and two register clearing
operations; sys_x86_64_recvfrom is similar (6 movs from stack, one register
rename mov, no xor).

__sys_recvfrom() then does the actual work, starting with pushing some
register contect out of the way and moving registers around, more or less
what SyS_recvfrom() does today.

So the result is nothing spectacular or unusual, but pretty equivalent and
possibly even shorter compared to current codepath.

Thanks,
	Dominik