lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Mon, 6 Jul 2015 08:54:59 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Eric Dumazet' <eric.dumazet@...il.com>
CC:	Joe Perches <joe@...ches.com>,
	Nicolae Rosia <nicolae.rosia@...tsign.ro>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	'Nicolas Ferre <nicolas.ferre@...el.com>
Subject: RE: [PATCH net-next] net: macb: replace literal constant with
 NET_IP_ALIGN

From: Eric Dumazet [mailto:eric.dumazet@...il.com]
> Sent: 03 July 2015 17:39
> On Fri, 2015-07-03 at 16:18 +0000, David Laight wrote:
> 
> > Even on x86 aligning the ethernet receive data on a 4n+2
> > boundary is likely to give marginally better performance
> > than aligning on a 4n boundary.
> 
> You are coming late to the party.

I've been to many parties at many different times....

Going back many years, Sun's original sbus DMA part generated a lot
of single sbus transfers for 4n+2 aligned buffers - so it was necessary
to do a 'realignment' copy. The later DMA+ (definitely the DMA2) part
did sbus burst transfers even when the buffer was 4n+2 aligned.
So with the later parts you could correctly align the buffer.

> Intel guys decided to change NET_IP_ALIGN to 0 (it was 2 in the past)
...
>     x86: Align skb w/ start of cacheline on newer core 2/Xeon Arch
> 
>     x86 architectures can handle unaligned accesses in hardware, and it has
>     been shown that unaligned DMA accesses can be expensive on Nehalem
>     architectures.  As such we should overwrite NET_IP_ALIGN to resolve
>     this issue.

My 2 cents:

I'd have thought it would depend on the nature of the 'DMA' requests
generated by the hardware - so ethernet hardware dependant.

The above may be correct for PCI masters - especially those that do
paired 16bit accesses for every 32bit word.
If the hardware generated cache line aligned PCI bursts I wouldn't
have thought it would matter.

I doubt it is valid for PCIe transfers - where the ethernet frame
will be split into (probably) 128byte TLPs. Even if it starts on
a 64n+2 boundary the splits will be on 64 byte boundaries since the
first and last 32bit words of the TLP have separate byte enables.

So I'd expect to see a cache line RMW for the first and last cache
lines - That may, or may not, be slower than the misaligned accesses
for the entire frame (1 clock data delay per access?)

Of course, modern nics will write 2 bytes of 'crap' before the frame.
Rounding up the transfer to the end of a cache line might also help
(especially if only a few extra words are needed).

	David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ