[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFqGfiGPf2ZkVeAqco+0BD2G72_TSGCz29dP_tvwQN0NQ@mail.gmail.com>
Date: Thu, 20 Mar 2025 21:24:38 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: x86@...nel.org, hkrzesin@...hat.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, olichtne@...hat.com,
atomasov@...hat.com, aokuliar@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
overlapping store
On Thu, Mar 20, 2025 at 8:33 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@...il.com> wrote:
> > >
> > > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > > 1 byte at a time loop to handle the tail.
> >
> > I definitely do not mind this patch, but I think it doesn't go far enough.
> >
> > It gets rid of the byte-at-a-time loop at the end, but only for the
> > short-copy case of 8-63 bytes.
> >
>
> This bit I can vouch for.
>
> > The .Llarge_movsq ends up still doing
> >
> > testl %ecx,%ecx
> > jne .Lcopy_user_tail
> > RET
> >
> > and while that is only triggered by the non-ERMS case, that's what
> > most older AMD CPU's will trigger, afaik.
> >
>
> This bit I can't.
>
> Per my other e-mail it has been several years since I was seriously
> digging in the area (around 7 by now I think) and details are rather
> fuzzy.
>
> I have a recollection that handling the tail after rep movsq with an
> overlapping store was suffering a penalty big enough to warrant a
> "normal" copy instead, avoiding the just written to area. I see my old
> routine $elsewhere makes sure to do it. I don't have sensible hw to
> bench this on either at the moment.
>
So I did some testing on Sapphire Rapids vs movsq (it is an
FSRM-enabled sucker so I had to force it to use it) and the penalty I
remembered is still there.
I patched read1 from will-it-scale to do 128 and 129 byte reads.
On the stock kernel I get about 5% throughput drop rate when adding
just the one byte.
On a kernel patched to overlap these with prior movsq stores I get a 10% drop.
While a CPU without ERMS/FSRM could have a different penalty, this
very much lines up with my prior experiments back in the day, so I'm
gonna have to assume this still is still a problem for others.
So I stand by only patching the short range for the time being.
Note that should someone(tm) rework this to use overlapping stores
with bigger ranges (like memcpy), this will become less of an issue as
there will be no per-byte copying.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists