[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFyvGkuM33C8ki=5-11idWWK4tHKuPaSrSb0FPTaJmC_iQ@mail.gmail.com>
Date: Mon, 5 Feb 2018 13:58:56 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Dan Williams <dan.j.williams@...el.com>
Cc: Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Andi Kleen <ak@...ux.intel.com>, X86 ML <x86@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Andy Lutomirski <luto@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2 1/3] x86/entry: Clear extra registers beyond syscall
arguments for 64bit kernels
On Mon, Feb 5, 2018 at 1:33 PM, Dan Williams <dan.j.williams@...el.com> wrote:
>
> On a suggestion from Arjan it also appears worthwhile to interleave
> 'mov' with 'xor'. Perf stat says that this test gets 3.45 instructions
> per cycle:
Ugh.
A "xor %reg/reg" is two bytes (three for the high regs due to REX
prefix). A "mov $0" is 7 bytes because unlike most of the ALU ops,
"mov" doesn't have a 8-bit expanding immediate.
So replacing those xors with movq's will add at least four bytes per
replacement. So you may well end up adding an L1 cache miss.
At which point "3.45 ipc" vs "2.88 ipc" is pretty much a non-issue.
I suspect that a bigger win would be if you try to interleave those
"xor" instructions with the "pushq" instructions in the entry code.
Because those push instructions tend to be limited by the LSU store
bandwidth, so you can probably put in xor instructions almost for free
in there.
Linus
Powered by blists - more mailing lists