linux-kernel - Re: Odd performance results

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20160712182635.GV7094@linux.vnet.ibm.com>
Date:	Tue, 12 Jul 2016 11:26:35 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	Peter Zijlstra <peterz@...radead.org>, tglx@...utronix.de,
	mingo@...e.hu, ak@...ux.intel.com, linux-kernel@...r.kernel.org
Subject: Re: Odd performance results

On Tue, Jul 12, 2016 at 10:49:58AM -0700, H. Peter Anvin wrote:
> On 07/12/16 08:05, Paul E. McKenney wrote:
> > On Tue, Jul 12, 2016 at 04:55:51PM +0200, Peter Zijlstra wrote:
> >> On Sun, Jul 10, 2016 at 07:43:27AM -0700, Paul E. McKenney wrote:
> >>> On Sun, Jul 10, 2016 at 07:17:19AM +0200, Peter Zijlstra wrote:
> >>>>
> >>>>
> >>>> On 10 July 2016 06:26:39 CEST, "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:
> >>>>> Hello!
> >>>>>
> >>>>> So I ran a quick benchmark which showed stair-step results.  I
> >>>>> immediately
> >>>>> thought "Ah, this is due to CPU 0 and 1, 2 and 3, 4 and 5, and 6 and 7
> >>>>> being threads in a core."  Then I thought "Wait, this is an x86!"
> >>>>> Then I dumped out cpu*/topology/thread_siblings_list, getting the
> >>>>> following:
> >>>>>
> >>>>> 	cpu0/topology/thread_siblings_list: 0-1
> >>>>> 	cpu1/topology/thread_siblings_list: 0-1
> >>>>> 	cpu2/topology/thread_siblings_list: 2-3
> >>>>> 	cpu3/topology/thread_siblings_list: 2-3
> >>>>> 	cpu4/topology/thread_siblings_list: 4-5
> >>>>> 	cpu5/topology/thread_siblings_list: 4-5
> >>>>> 	cpu6/topology/thread_siblings_list: 6-7
> >>>>> 	cpu7/topology/thread_siblings_list: 6-7
> >>>>
> >>>>
> >>>> I'm guessing this is an AMD bulldozer like machine?
> >>>
> >>> /proc/cpuinfo thinks otherwise:
> >>>
> >>> processor	: 0
> >>> vendor_id	: GenuineIntel
> >>> cpu family	: 6
> >>> model		: 60
> >>> model name	: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
> >>
> >> Weird, I've never seen an Intel box do that before... hpa, any idea? or
> >> is this just one weird BIOS.
> > 
> > ;-)
> > 
> > It is a Lenovo W541 laptop, for whatever that might be worth.  Roughly
> > on year old.
> 
> Well, the obvious thing here is that CPUs 0-1, 2-3, 4-5, and 6-7 *are*
> indeed threads in a core... Intel x86 products have supported
> multithreading since the Pentium 4.  So the "wait, this is an x86!" bit
> is strange to me.
> 
> The CPU in question (and /proc/cpuinfo should show this) has four cores
> with a total of eight threads.  The "siblings" and "cpu cores" fields in
> /proc/cpuinfo should show the same thing.  So I am utterly confused
> about what is unexpected here?

My prior experience with Intel x86 systems led me to expect that the
hardware-thread pairs would instead be 0 and 4, 1 and 5, 2 and 6, and 3
and 7.  This would result in a graph with a two-segment line, having
higher slope for the lower-numbered CPUs and a lower slope for the
higher-numbered CPUs, and I have in fact seen this behavior on older
Intel x86 systems.  See for example slides 64-67 of:

http://www.rdrop.com/users/paulmck/scalability/paper/Updates.2016.06.05a.TUDresden.pdf

But don't get me wrong, I do very much prefer the CPU-numbering approach
that my laptop uses, where the hardware threads in a given core have
consecutive numbers.

> Also, you mentioned absolutely nothing about what kind of benchmark it
> was, or what the "stairstepping" results imply, so it doesn't really
> make it any easier...

The benchmark was a POSIX-threads multithreaded benchmark with each
thread repeatedly searching a small linked list, which should fit into
the nearest-to-CPU cache.  The "stairstepping" results suggest to me
that a no-cache-miss pointer-following workload allows a single hardware
thread to consume most of a given core's relevant hardware resources,
at least on this particular chip.  Which is fine -- this sort of thing
always has been workload-specific.

If you want to see an example plot, take a look at:

	CodeSamples/defer/perf-rcu-qsbr.eps

within:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

							Thanx, Paul