lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFwo7yA1gm8AUYMEQA8ZNY-9GGF8Oup09jJFvEa4J7C+jA@mail.gmail.com>
Date:   Fri, 16 Mar 2018 12:42:16 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Miller <davem@...emloft.net>
Cc:     Dominik Brodowski <linux@...inikbrodowski.net>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2
 == netdev)

On Fri, Mar 16, 2018 at 11:30 AM, David Miller <davem@...emloft.net> wrote:
>
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme.  For
> example, passing values in registers instead of on the stack.

Actually, it's almost exactly the reverse.

On x86-64, we'd like to just pass the 'struct pt_regs *' pointer, and
have the sys_xyz() function itself just pick out the arguments it
needs from there.

That has a few reasons for it:

 - we can clear all registers at system call entry, which helps defeat
some of the "pass seldom used register with user-controlled value that
survives deep into the callchain" things that people used to leak
information

 - we can streamline the low-level system call code, which needs to
pass around 'struct pt_regs *' anyway, and the system call only picks
up the values it actually needs

 - it's really quite easy(*) to just make the SYSCALL_DEFINEx() macros
just do it all with a wrapper inline function

but it fundamentally means that you *cannot* call 'sys_xyz()' from
within the kernel, unless you then do it with something crazy like

    struct pt_regs myregs;
    ... fill in the right registers for this architecture _if_ this
architecture uses ptregs ..
    sys_xyz(&regs);

which I somehow really doubt you want to do in the networking code.

Now, I did do one version that just created two entrypoints for every
single system call - the "kernel version" and the "real" system call
version. That sucks, because you have two choices:

 - either pointlessly generate extra code for the 200+ system calls
that are *not* used by the kernel

 - or let gcc just merge the two, and make code generation suck where
the real system call just loads the registers and jumps to the common
code.

That second option really does suck, because if you let the compiler
just generate the _single_ system call, it will do the "load actual
value from ptregs" much more nicely, and only when it needs it, and
schedules it all into the system call code.

So just making the rule be: "you mustn't call the SYSCALL_DEFINEx()
functions from anything but the system call code" really makes
everything better.

Then you only need to fix up the *handful* of so system calls that
actually have in-kernel callers.

Many of them end up being things that could be improved on further
anyway (ie there's discussion about further cleanup and trying to
avoid using "set_fs()" for arguments etc, because there already exists
helper functions that take the kernel-space versions, and the
sys_xyz() version is actually just going through stupid extra work for
a kernel user).

                    Linus

(*) The "really quite easy" is only true on 64-bit architectures.
32-bit architectures have issues with packing 64-bit values into two
registers, so using macro expansion with just the number of arguments
doesn't work.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ