lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150304084240.GA3233@pd.tnic>
Date:	Wed, 4 Mar 2015 09:42:40 +0100
From:	Borislav Petkov <bp@...en8.de>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	X86 ML <x86@...nel.org>, Andy Lutomirski <luto@...capital.net>,
	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v2 05/15] x86/alternatives: Use optimized NOPs for padding

On Wed, Mar 04, 2015 at 07:43:03AM +0100, Ingo Molnar wrote:
> But the main question is, do such alignment details ever matter to
> decoder performance?

Right, so I have some ideas about it but this all is probably very
uarch-specific so below is probably only speculation.

So both optimization manuals suggest using longer NOPs is better
probably because the less micro-ops you decode from the NOPs, the less
instructions you have in the pipe, less resources, etc etc. Thus, making
them longer with prefixes is better than having multiple unprefixed and
shorter nops.

Now, there's the dependency on operands and NOPs generally reference
rAX. On some uarches this dependency is broken much earlier so the
micro-op gets scheduled earlier. Thus freeing resources earlier, etc,
etc.

If the NOP is crossing cacheline, you obviously can't know it is a NOP
yet so you have to fetch the next cacheline to finish decoding it. So in
the best case you'll go down to L1 and in the worst case go to memory.

And on modern machines this is probably so very unnoticeable. I hardly
doubt we'll see that in a hotpath even. But it could be a good measuring
exercise :)

Btw, when we pad JMPs with NOPs, we shouldn't be affected because the
unconditional JMP will not have us decode the NOPs behind it. And
modern, hungry prefetching beasts would've probably fetched and decoded
a bunch of instructions along with the NOPs. But they should see the JMP
and stop prefetching after it though. Who knows, uarch stuff.

See, all speculations :)

Btw #2 and more importantly: this patchset of mine doesn't necessarily
enlarge the patch sites - before it, you'll have to add proper-sized
NOPs explicitly and now we let the toolchain do it for us, which
sometimes even leads to smaller NOPs depending on the instruction the
toolchain generates.

With the JMP optimizations, the instructions become smaller too.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ