lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190913073530.GA125477@gmail.com>
Date:   Fri, 13 Sep 2019 09:35:30 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Borislav Petkov <bp@...en8.de>
Cc:     x86-ml <x86@...nel.org>, Andy Lutomirski <luto@...nel.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        lkml <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] Improve memset


* Borislav Petkov <bp@...en8.de> wrote:

> Hi,
> 
> since the merge window is closing in and y'all are on a conference, I
> thought I should take another stab at it. It being something which Ingo,
> Linus and Peter have suggested in the past at least once.
> 
> Instead of calling memset:
> 
> ffffffff8100cd8d:       e8 0e 15 7a 00          callq  ffffffff817ae2a0 <__memset>
> 
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
> 
> ...
> ffffffff81000442:       4c 89 d7                mov    %r10,%rdi
> ffffffff81000445:       b9 00 10 00 00          mov    $0x1000,%ecx
> 
> <---- new memset
> ffffffff8100044a:       f3 aa                   rep stos %al,%es:(%rdi)
> ffffffff8100044c:       90                      nop
> ffffffff8100044d:       90                      nop
> ffffffff8100044e:       90                      nop
> <----
> 
> ffffffff8100044f:       4c 8d 84 24 98 00 00    lea    0x98(%rsp),%r8
> ffffffff81000456:       00
> ...
> 
> And since the majority of x86 boxes out there is Intel, they haz
> X86_FEATURE_ERMS so they won't even need to alternative-patch those call
> sites when booting.
> 
> In order to patch on machines which don't set X86_FEATURE_ERMS, I need
> to do a "reversed" patching of sorts, i.e., patch when the x86 feature
> flag is NOT set. See the below changes in alternative.c which basically
> add a flags field to struct alt_instr and thus control the patching
> behavior in apply_alternatives().
> 
> The result is this:
> 
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
>         void *ret, *dummy;
> 
>         asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
>                                            "call memset_rep",  X86_FEATURE_ERMS,
>                                            "call memset_orig", X86_FEATURE_REP_GOOD)
>                 : "=&D" (ret), "=a" (dummy)
>                 : "0" (dest), "a" (c), "c" (n)
>                 /* clobbers used by memset_orig() and memset_rep_good() */
>                 : "rsi", "rdx", "r8", "r9", "memory");
> 
>         return dest;
> }
> 
> and so in the !ERMS case, we patch in a call to the memset_rep() version
> which is the old variant in memset_64.S. There we need to do some reg
> shuffling because I need to map the registers from where REP; STOSB
> expects them to where the x86_64 ABI wants them. Not a big deal - a push
> and two moves and a pop at the end.
> 
> If X86_FEATURE_REP_GOOD is not set either, we fallback to another call
> to the original unrolled memset.
> 
> The rest of the diff is me trying to untangle memset()'s definitions
> from the early code too because we include kernel proper headers there
> and all kinds of crazy include hell ensues but that later.
> 
> Anyway, this is just a pre-alpha version to get people's thoughts and
> see whether I'm in the right direction or you guys might have better
> ideas.

That looks exciting - I'm wondering what effects this has on code 
footprint - for example defconfig vmlinux code size, and what the average 
per call site footprint impact is?

If the footprint effect is acceptable, then I'd expect this to improve 
performance, especially in hot loops.

Thanks,

	Ingo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ