[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1453136509.1223.238.camel@edumazet-glaptop2.roam.corp.google.com>
Date: Mon, 18 Jan 2016 09:01:49 -0800
From: Eric Dumazet <eric.dumazet@...il.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Felix Fietkau <nbd@...nwrt.org>,
David Laight <David.Laight@...LAB.COM>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>,
Alexander Duyck <alexander.duyck@...il.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
Daniel Borkmann <borkmann@...earbox.net>,
Marek Majkowski <marek@...udflare.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Florian Westphal <fw@...len.de>,
Paolo Abeni <pabeni@...hat.com>,
John Fastabend <john.r.fastabend@...el.com>
Subject: Re: Optimizing instruction-cache, more packets at each stage
On Mon, 2016-01-18 at 12:54 +0100, Jesper Dangaard Brouer wrote:
> That is very interesting. These kind of icache optimization will then
> likely benefit lower-end devices more than high end Intel CPUs :-)
>
> AFAIK the Intel CPUs are masking this icache problem, by having a icache
> prefetcher and optimizing how fast the CPU can load/refill from higher
> level caches. Intel CPUs have a lot of HW-logic around this, which the
> I assume the smaller CPUs don't. E.g. quote from Intel Optimization
> Reference Manual:
>
> "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned
> instruction bytes each cycle from the instruction cache to the
> instruction length decoder (ILD). The instruction queue (IQ) buffers
> the ILD-processed instructions and can deliver up to four instructions
> in one cycle to the instruction decoder."
>
This does not tell how many core/threads can fetch 16 bytes per cycle.
With more than 36 execution units per socket, single peak performance of
one unit does not reflect what happens when all units are busy and
contend on shared resource.
If we want to properly exploit L1 caches of each execution unit, we need
to split the load in a pipeline. But the number of units depend on
hardware capabilities (like L1 cache size). Something hard to code in a
generic way (linux kernel)
For example, having the same core handling RX and TX interrupts are not
the best choice, especially when TX interrupts have to call expensive
callbacks to upper layers (TCP Small Queues).
Powered by blists - more mailing lists