lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 5 Jan 2024 23:52:45 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>
CC: Noah Goldstein <goldstein.w.n@...il.com>, kernel test robot
	<lkp@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	"oe-kbuild-all@...ts.linux.dev" <oe-kbuild-all@...ts.linux.dev>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"edumazet@...gle.com" <edumazet@...gle.com>, "tglx@...utronix.de"
	<tglx@...utronix.de>, "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de"
	<bp@...en8.de>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"hpa@...or.com" <hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling

From: Linus Torvalds
> Sent: 05 January 2024 18:06
> 
> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@...lab.com> wrote:
> >
> > Interesting, I'm pretty sure trying to get two blocks of
> >  'adc' scheduled in parallel like that doesn't work.
> 
> You should check out the benchmark at
> 
>        https://github.com/fenrus75/csum_partial
> 
> and see if you can improve on it. I'm including the patch (on top of
> that code by Arjan) to implement the actual current kernel version as
> "New version".

I'd have to fix his benchmark code first :-)
You can't use the TSC unless you lock the cpu frequency.
The longer the test runs for the faster the cpu will run.
I've had to use the performance counters to get accurate cycle counts.

I'm also sure I tried two separate 'adc' chains and failed
to get any overlapped instrictions.
But I'll try his loop in my framework.

The 'lfence' version also really does matter.
In some sense it doesn't matter if you add 10 clocks to the
IP header checksum - they'll be dwarfed by everything else.
But if you do have to checksum an 8k NFS UDP datagram the
loop count matters.

The OOO engine is very good a just piling up instructions
that are waiting for previous instructions in a big queue
and executing instruction for which the data is available.
So the control flow for the checksum code can finish well
before the checksum is available.

On a related point, do you remember what the 'killer app'
was for doing the checksum in copy_to/from_user?

The change pre-dates git and I can't find the commit message.
The only place it is done in the receive path is for UDP.
At a guess that helped 8k UDP datagrams going into a userspace nfsd?
I bet nothing really depends on RX UDP performance to userspace
on systems that don't do hw checksum?

The tx side is done copying data into an skb, I'm not sure if
that is all copies - you really don't want to do it if the
hardware supports tx checksum offload.
Using 'rep movsb' for copy_from_user() has  to be faster than
the I-cache busting 'copy and checksum' code - even if you
end up doing a checksum not much later on.

IIRC (I looked a couple of weeks ago) only x86, sparc32 and m68k
actually have 'copy and checksum' (and the m68k didn't seem to
have the markers for faults!).
(All of them are I-cache killers)
The default iovec[] version checksums every fragment - which
gives grief trying to report a 32bit checksum and EFAULT.
OTOH it could sum the linear kernel buffer after the copy.
That would let it fold the checksum to 17bits and report -1
for EFAULT.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ