[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHGNFT+LC24842ZKLWBxD3vvvddBqDKa6gkixN4Esor+RQ@mail.gmail.com>
Date: Thu, 20 Mar 2025 20:33:35 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: x86@...nel.org, hkrzesin@...hat.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, olichtne@...hat.com,
atomasov@...hat.com, aokuliar@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
overlapping store
On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > 1 byte at a time loop to handle the tail.
>
> I definitely do not mind this patch, but I think it doesn't go far enough.
>
> It gets rid of the byte-at-a-time loop at the end, but only for the
> short-copy case of 8-63 bytes.
>
This bit I can vouch for.
> The .Llarge_movsq ends up still doing
>
> testl %ecx,%ecx
> jne .Lcopy_user_tail
> RET
>
> and while that is only triggered by the non-ERMS case, that's what
> most older AMD CPU's will trigger, afaik.
>
This bit I can't.
Per my other e-mail it has been several years since I was seriously
digging in the area (around 7 by now I think) and details are rather
fuzzy.
I have a recollection that handling the tail after rep movsq with an
overlapping store was suffering a penalty big enough to warrant a
"normal" copy instead, avoiding the just written to area. I see my old
routine $elsewhere makes sure to do it. I don't have sensible hw to
bench this on either at the moment.
That said, if you insist on it, I'll repost v2 with the change (I'm
going to *test* it of course, just not bench. :>)
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists