lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250320213145.6d016e21@pumpkin>
Date: Thu, 20 Mar 2025 21:31:45 +0000
From: David Laight <david.laight.linux@...il.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Herton Krzesinski <hkrzesin@...hat.com>, x86@...nel.org,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
 torvalds@...ux-foundation.org, olichtne@...hat.com, atomasov@...hat.com,
 aokuliar@...hat.com
Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic
 (when without FSRM/ERMS)

On Thu, 20 Mar 2025 19:02:21 +0100
Mateusz Guzik <mjguzik@...il.com> wrote:

> On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@...hat.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@...il.com> wrote:  
...
> > > That said, have you experimented with aligning the target to 16 bytes
> > > or more bytes?  
> >
> > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> > and got no improvement at least in the specific benchmark I'm doing here.
> > Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> > well and got no difference from the 8-byte alignment, same bench as well.
> > I tried to do 8-byte alignment for the ERMS case on Intel and got no
> > difference on the systems I tested. I'm not saying it may not improve in
> > some other case, just that in my specific testing I couldn't tell/measure
> > any improvement.
> >  
> 
> oof, I would not got as far back as Sandy Bridge. ;)

It is a boundary point.
Agner's tables (fairly reliable have):

Sandy Bridge
Page 222
MOVS 5 4
REP MOVS 2n 1.5 n  worst case
REP MOVS 3/16B 1/16B best case

which is the same as Ivy bridge - which you'd sort of expect since
Ivy bridge is a minor update, Agner's tables have the same values for it.
Haswell jumps to 1/32B.

I didn't test Sandy bridge (I've got one, powered off), but did test Ivy Bridge.
Neither the source nor destination alignment made any difference at all.

As I said earlier the only alignment that made any difference was 32byte
aligning the destination on Haswell (and later).
That is needed to get 32 bytes/clock rather than 16 bytes/clock.

> 
> I think Skylake is the oldest yeller to worry about, if one insists on it.
> 
> That said, if memory serves right these bufs like to be misaligned to
> weird extents, it very well may be in your tests aligning to 8 had a
> side effect of aligning it to 16 even.
> 
> > >
> > > Moreover, I have some recollection that there were uarchs with ERMS
> > > which also liked the target to be aligned -- as in perhaps this should
> > > be done regardless of FSRM?  

Dunno, the only report is some AMD cpu being slow with misaligned writes.
But that is the copy loop, not 'rep movsq'.
I don't have one to test.

> >
> > Where I tested I didn't see improvements but may be there is some case,
> > but I didn't have any.
> >  
> > >
> > > And most importantly memset, memcpy and clear_user would all use a
> > > revamp and they are missing rep handling for bigger sizes (I verified
> > > they *do* show up). Not only that, but memcpy uses overlapping stores
> > > while memset just loops over stuff.
> > >
> > > I intended to sort it out long time ago and maybe will find some time
> > > now that I got reminded of it, but I would be deligthed if it got
> > > picked up.
> > >
> > > Hacking this up is just some screwing around, the real time consuming
> > > part is the benchmarking so I completely understand if you are not
> > > interested.  
> >
> > Yes, the most time you spend is on benchmarking. May be later I could
> > try to take a look but will not put any promises on it.

I found I needed to use the performance counter to get a proper cycle count.
But then directly read the register to avoid all the 'library' overhead.
Then add lfence/mfence both sides of the cycle count read.
After subtracting the overhead of a 'null function' I could measure the
number of clocks each operation took.
So could tell when I was actually getting 32 bytes copied per clock.

(Or testing the ip checksum code the number of bytes/clock - can get to 12).

	David

> >  
> 
> Now I'm curious enough what's up here. If I don't run out of steam,
> I'm gonna cover memset and memcpy myself.
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ