lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 10 Jan 2015 13:27:20 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Denys Vlasenko <vda.linux@...glemail.com>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Borislav Petkov <bp@...en8.de>,
	Denys Vlasenko <dvlasenk@...hat.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Oleg Nesterov <oleg@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	X86 ML <x86@...nel.org>, Alexei Starovoitov <ast@...mgrid.com>,
	Will Drewry <wad@...omium.org>,
	Kees Cook <keescook@...omium.org>
Subject: Re: [PATCH 3/4] x86: open-code register save/restore in
 trace_hardirqs thunks

On Sat, Jan 10, 2015 at 1:09 PM, Denys Vlasenko
<vda.linux@...glemail.com> wrote:
>
> I think using push/pop is okay. In the very hottest code paths
> you may want to prefer mov's.

For kernel entrypoints in particular, the code sequence is quite
possibly constrained by the decoder and instruction fetch rather than
the execution engine. Even if the entrypoint were to be in the L1 I$
(which is not generally the case except in microbenchmarks), I am
pretty sure that even Intel doesn't actually speculatively decode
across system call boundaries, so unlike normal nice code, you don't
have the front end running ahead of the execution engine.

Looking at the system call hotpath, for example, it looks like we
save/restore 8 registers. So 16 instructions or about 80 bytes of
code. I could easily imagine us avoiding one cacheline access by using
shorter 1- and 2-byte push/pop instructions (depending a bit on how
the cacheline alignment works out, of course).

Depending on how well it prefetches from L2 and/or exact decoder
details, that kind of issue *can* overshadow the actual execution
costs. Of course, on microbenchmarks (eg some system call benchmark
that does "getppid()" in a loop), even the kernel side stays in the
L1, so those might show possible execution issues more. And
macrobenchmarks probably won't show a cycle or two in the system call
or fault path anyway.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ