lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250318215926.0a7fd34e@pumpkin>
Date: Tue, 18 Mar 2025 21:59:26 +0000
From: David Laight <david.laight.linux@...il.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Herton Krzesinski <hkrzesin@...hat.com>, Linus Torvalds
 <torvalds@...ux-foundation.org>, x86@...nel.org, tglx@...utronix.de,
 mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 linux-kernel@...r.kernel.org, olichtne@...hat.com, atomasov@...hat.com,
 aokuliar@...hat.com
Subject: Re: [PATCH] x86: add back the alignment of the destination to 8
 bytes in copy_user_generic()

On Sun, 16 Mar 2025 12:09:47 +0100
Ingo Molnar <mingo@...nel.org> wrote:

> * Ingo Molnar <mingo@...nel.org> wrote:
> 
> > > It does look good in my testing here, I built same kernel I was 
> > > using for testing the original patch (based on 6.14-rc6), this is 
> > > one of the results I got in one of the runs testing on the same 
> > > machine:
> > > 
> > >              CPU      RATE          SYS          TIME     sender-receiver
> > > Server bind   19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2%
> > > Server bind   21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7%
> > > Server bind   23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7%
> > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1%
> > > 
> > > There are still some variations between runs, which is expected as 
> > > was the same when I tested my patch or in the not aligned case, but 
> > > it's consistently better/higher than the no align case. Looks 
> > > really it's sufficient to align for the higher than or equal 64 
> > > bytes copy case.  
> > 
> > Mind sending a v2 patch with a changelog and these benchmark numbers 
> > added in, and perhaps a Co-developed-by tag with Linus or so?  
> 
> BTW., if you have a test system available, it would be nice to test a 
> server CPU in the Intel spectrum as well. (For completeness mostly, I'd 
> not expect there to be as much alignment sensitivity.)
> 
> The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's 
> still within the window of microarchitectures we care about. An Intel 
> test would be nice from a similar timeframe as well. Older is probably 
> better in this case, but not too old. :-)

Is that loop doing aligned 'rep movsq' ?

Pretty much all the Intel (non-atom) cpu have some variant of FRSM.
For FRSM you get double the throughput if the destination is 32byte aligned.
No other alignment makes any difference.
The cycle cost is per 16/32 byte block and different families have
different costs for the first few blocks, after than you get 1 block/clock.
That goes all the way back to Sandy Bridge and Ivy Bridge.
I don't think anyone has tried doing that alignment.

I'm sure I've measured misaligned 64bit writes and got no significant cost.
It might be one extra clock for writes than cross cache line boundaries.
Misaligned reads are pretty much 'cost free' - just about measurable
on the ip-checksum code loop (and IIRC even running a three reads every
two clocks algorithm).

I don't have access to a similar range of amd chips.

	David

> 
> ( Note that the Intel test is not required to apply the fix IMO - we 
>   did change alignment patterns ~2 years ago in a5624566431d which 
>   regressed. )
> 
> Thanks,
> 
> 	Ingo
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ