netdev - Re: Optimizing instruction-cache, more packets at each stage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 18 Jan 2016 09:01:49 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Jesper Dangaard Brouer <brouer@...hat.com>
Cc:	Felix Fietkau <nbd@...nwrt.org>,
	David Laight <David.Laight@...LAB.COM>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Alexander Duyck <alexander.duyck@...il.com>,
	Alexei Starovoitov <alexei.starovoitov@...il.com>,
	Daniel Borkmann <borkmann@...earbox.net>,
	Marek Majkowski <marek@...udflare.com>,
	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	Florian Westphal <fw@...len.de>,
	Paolo Abeni <pabeni@...hat.com>,
	John Fastabend <john.r.fastabend@...el.com>
Subject: Re: Optimizing instruction-cache, more packets at each stage

On Mon, 2016-01-18 at 12:54 +0100, Jesper Dangaard Brouer wrote:

> That is very interesting. These kind of icache optimization will then
> likely benefit lower-end devices more than high end Intel CPUs :-)
> 
> AFAIK the Intel CPUs are masking this icache problem, by having a icache
> prefetcher and optimizing how fast the CPU can load/refill from higher
> level caches.  Intel CPUs have a lot of HW-logic around this, which the
> I assume the smaller CPUs don't.  E.g. quote from Intel Optimization
> Reference Manual:
> 
>  "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned
>   instruction bytes each cycle from the instruction cache to the
>   instruction length decoder (ILD). The instruction queue (IQ) buffers
>   the ILD-processed instructions and can deliver up to four instructions
>   in one cycle to the instruction decoder."
> 

This does not tell how many core/threads can fetch 16 bytes per cycle.

With more than 36 execution units per socket, single peak performance of
one unit does not reflect what happens when all units are busy and
contend on shared resource.

If we want to properly exploit L1 caches of each execution unit, we need
to split the load in a pipeline. But the number of units depend on
hardware capabilities (like L1 cache size). Something hard to code in a
generic way (linux kernel)

For example, having the same core handling RX and TX interrupts are not
the best choice, especially when TX interrupts have to call expensive
callbacks to upper layers (TCP Small Queues).