[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CB5C530@AcuExch.aculab.com>
Date: Mon, 6 Jul 2015 08:54:59 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Eric Dumazet' <eric.dumazet@...il.com>
CC: Joe Perches <joe@...ches.com>,
Nicolae Rosia <nicolae.rosia@...tsign.ro>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
'Nicolas Ferre <nicolas.ferre@...el.com>
Subject: RE: [PATCH net-next] net: macb: replace literal constant with
NET_IP_ALIGN
From: Eric Dumazet [mailto:eric.dumazet@...il.com]
> Sent: 03 July 2015 17:39
> On Fri, 2015-07-03 at 16:18 +0000, David Laight wrote:
>
> > Even on x86 aligning the ethernet receive data on a 4n+2
> > boundary is likely to give marginally better performance
> > than aligning on a 4n boundary.
>
> You are coming late to the party.
I've been to many parties at many different times....
Going back many years, Sun's original sbus DMA part generated a lot
of single sbus transfers for 4n+2 aligned buffers - so it was necessary
to do a 'realignment' copy. The later DMA+ (definitely the DMA2) part
did sbus burst transfers even when the buffer was 4n+2 aligned.
So with the later parts you could correctly align the buffer.
> Intel guys decided to change NET_IP_ALIGN to 0 (it was 2 in the past)
...
> x86: Align skb w/ start of cacheline on newer core 2/Xeon Arch
>
> x86 architectures can handle unaligned accesses in hardware, and it has
> been shown that unaligned DMA accesses can be expensive on Nehalem
> architectures. As such we should overwrite NET_IP_ALIGN to resolve
> this issue.
My 2 cents:
I'd have thought it would depend on the nature of the 'DMA' requests
generated by the hardware - so ethernet hardware dependant.
The above may be correct for PCI masters - especially those that do
paired 16bit accesses for every 32bit word.
If the hardware generated cache line aligned PCI bursts I wouldn't
have thought it would matter.
I doubt it is valid for PCIe transfers - where the ethernet frame
will be split into (probably) 128byte TLPs. Even if it starts on
a 64n+2 boundary the splits will be on 64 byte boundaries since the
first and last 32bit words of the TLP have separate byte enables.
So I'd expect to see a cache line RMW for the first and last cache
lines - That may, or may not, be slower than the misaligned accesses
for the entire frame (1 clock data delay per access?)
Of course, modern nics will write 2 bytes of 'crap' before the frame.
Rounding up the transfer to the end of a cache line might also help
(especially if only a few extra words are needed).
David
Powered by blists - more mailing lists