linux-kernel - Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250322120213.1420450a@pumpkin>
Date: Sat, 22 Mar 2025 12:02:13 +0000
From: David Laight <david.laight.linux@...il.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, x86@...nel.org,
 hkrzesin@...hat.com, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, hpa@...or.com, olichtne@...hat.com,
 atomasov@...hat.com, aokuliar@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
 overlapping store

On Thu, 20 Mar 2025 21:24:38 +0100
Mateusz Guzik <mjguzik@...il.com> wrote:

> On Thu, Mar 20, 2025 at 8:33 PM Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
> > <torvalds@...ux-foundation.org> wrote:  
> > >
> > > On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@...il.com> wrote:  
> > > >
> > > > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > > > 1 byte at a time loop to handle the tail.  
> > >
> > > I definitely do not mind this patch, but I think it doesn't go far enough.
> > >
> > > It gets rid of the byte-at-a-time loop at the end, but only for the
> > > short-copy case of 8-63 bytes.
> > >  
> >
> > This bit I can vouch for.
> >  
> > > The .Llarge_movsq ends up still doing
> > >
> > >         testl %ecx,%ecx
> > >         jne .Lcopy_user_tail
> > >         RET
> > >
> > > and while that is only triggered by the non-ERMS case, that's what
> > > most older AMD CPU's will trigger, afaik.
> > >  
> >
> > This bit I can't.
> >
> > Per my other e-mail it has been several years since I was seriously
> > digging in the area (around 7 by now I think) and details are rather
> > fuzzy.
> >
> > I have a recollection that handling the tail after rep movsq with an
> > overlapping store was suffering a penalty big enough to warrant a
> > "normal" copy instead, avoiding the just written to area. I see my old
> > routine $elsewhere makes sure to do it. I don't have sensible hw to
> > bench this on either at the moment.
> >  
> 
> So I did some testing on Sapphire Rapids vs movsq (it is an
> FSRM-enabled sucker so I had to force it to use it) and the penalty I
> remembered is still there.
> 
> I patched read1 from will-it-scale to do 128 and 129 byte reads.
> 
> On the stock kernel I get about 5% throughput drop rate when adding
> just the one byte.
> 
> On a kernel patched to overlap these with prior movsq stores I get a 10% drop.
> 
> While a CPU without ERMS/FSRM could have a different penalty, this
> very much lines up with my prior experiments back in the day, so I'm
> gonna have to assume this still is still a problem for others.

A different penalty for ERMS/FSRM wouldn't surprise me.
If you tested on an Intel cpu it probably supported ERMS/FSRM
(you have to go back to 'core-2' to not have the support).

So I suspect you'd need to have tested an AMD cpu for it to matter.
And the report of slow misaligned writes was for a fairly recent
AMD cpu - I think one that didn't report ERMS/FSRM support.
I've only got a zen-5, although I might 'borrow' a piledriver from work.
Definitely no access to anything amd in between.

I do wonder if the 'medium length' copy loop is needed at all.
The non ERMS/FSRM case needs 'short copy' and aligned 'rep movsq'
copies, but is the setup cost for 'rep movsq' significant enough
for the 'copy loop' to be worth while?

I also suspect the copy loop is 'over unrolled'.
The max throughput is 8 bytes/clock and Intel cpu will execute
2 clock loops, so if you can reduce the loop control (etc) instructions
to ones that will run in two clocks you can do 16 bytes per iteration.
That should be achievable using negative offsets from the end of the
buffer - the loop control is 'add $16,%rcx; js 10b'.
(I've mentioned that before ...)

I've just done some more searching and found a comment that FRMS is
very slow on zen3 for misaligned destinations.
(Possibly to the level of dropping back to byte copies.)
That means that the 'rep movsq' also needs to be aligned.
Which is what started all this.

I'm going to have to see if my Sandy bridge system still works.

	David

> 
> So I stand by only patching the short range for the time being.
> 
> Note that should someone(tm) rework this to use overlapping stores
> with bigger ranges (like memcpy), this will become less of an issue as
> there will be no per-byte copying.