linux-kernel - Re: [PATCH] x86: Align jump targets to 1 byte boundaries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 10 Apr 2015 17:48:51 +0200
From:	Denys Vlasenko <dvlasenk@...hat.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	Ingo Molnar <mingo@...nel.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Jason Low <jason.low2@...com>,
	Peter Zijlstra <peterz@...radead.org>,
	Davidlohr Bueso <dave@...olabs.net>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Aswin Chandramouleeswaran <aswin@...com>,
	LKML <linux-kernel@...r.kernel.org>,
	Andy Lutomirski <luto@...capital.net>,
	Brian Gerst <brgerst@...il.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries

On 04/10/2015 05:25 PM, Borislav Petkov wrote:
>> measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
>> up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
>> lower than these maximum values when instructions are misaligned.
>> Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
>> You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
>> the first four instructions after a critical label.
>> """
> 
> All F15h models are Bulldozer uarch with improvements. For example,
> later F15h models have things like loop buffer and loop predictor
> which can replay loops under certain conditions, thus diminishing the
> importance of the fetch window size wrt to loops performance.
> 
> And then there's AMD F16h Software Optimization Guide, that's the Jaguar
> uarch:
> 
> "...The processor can fetch 32 bytes per cycle and can scan two 16-byte
> instruction windows for up to two instruction decodes per cycle.

As you know, manuals are not be-all, end-all documents.
They contains mistakes. And they are written before silicon
is finalized, and sometimes they advertise capabilities
which in the end had to be downscaled. It's hard to check
a 1000+ pages document and correct all mistakes, especially
hard-to-quantify ones.

In the same document by Agner Fog, he says that he failed to confirm
32-byte fetch on Fam16h CPUs:

"""
16 AMD Bobcat and Jaguar pipeline
...
...
16.2 Instruction fetch
The instruction fetch rate is stated as "up to 32 bytes per cycle",
but this is not confirmed by my measurements which consistently show
a maximum of 16 bytes per clock cycle on average for both Bobcat and
Jaguar. Some reports say that the Jaguar has a loop buffer,
but I cannot detect any improvement in performance for tiny loops.
"""


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/