[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHHjk6PXU33CqAX6oitbEED-jzs2p=Dfm5-q_xMy5Cy_Uw@mail.gmail.com>
Date: Fri, 21 Mar 2025 21:10:39 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: x86@...nel.org, hkrzesin@...hat.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, olichtne@...hat.com,
atomasov@...hat.com, aokuliar@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
overlapping store
On Fri, Mar 21, 2025 at 12:53 AM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@...il.com> wrote:
> > >
> > > I have a recollection that handling the tail after rep movsq with an
> > > overlapping store was suffering a penalty big enough to warrant a
> > > "normal" copy instead, avoiding the just written to area.
> >
> > Ahh. Good point. The rep movsq might indeed end up having odd effects
> > with subsequent aliasing memory operations.
> >
> > Consider myself convinced.
>
> Actually, I think there's a solution for this.
>
> Do not do the last 0-7 bytes as a word that overlaps with the tail of
> the 'rep movs'
>
> Do the last 8-15 bytes *non-overlapping* (well, they overlap each
> other, but not the 'rep movs')
>
> Something UNTESTED like the appended, in other words. The large case
> then ends up without any conditionals, looking something like this:
>
> mov %rcx,%rax
> shr $0x3,%rcx
> dec %rcx
> and $0x7,%eax
> rep movsq %ds:(%rsi),%es:(%rdi)
> mov (%rsi),%rcx
> mov %rcx,(%rdi)
> mov (%rsi,%rax,1),%rcx
> mov %rcx,(%rdi,%rax,1)
> xor %ecx,%ecx
> ret
>
> with some added complexity - but not a lot - in the exception fixup cases.
>
> This is once again intentionally whitespace-damaged, because I don't
> want people applying this mindlessly. Somebody needs to double-check
> my logic, and verify that this also avoids the cost from the aliasing
> with the rep movs.
>
I think the code works, but I find the idea to be questionable.
I'm worried what happens when the consumer has a nicely aligned target
buffer with a nice size (say, a multiple of a cache line) which this
is now unconditionally messing with -- are some uarchs going to
suffer?
That aside, this executes for any size >= 64 bytes. The start latency
is pretty high and the 64 byte limit is probably too low even without
the patch. For the 64 size in particular there will be only 48 bytes
handled by rep movsq followed by 16 bytes of regular stores, largely
defeating the point of avoiding the regular mov loop. If going this
way, I would bump the threshold to punt to rep movsq by at least 16.
I would bench it myself, but don't have appropriate hw to do this sucker.
All in all, for the time being I stand by not messing with this bit
(i.e., sticking to just my patch). Serious benchmarking effort should
instead accompany rewriting the routine to begin with (along with
sorting out memset and memcpy to both use overlapping stores and rep
as needed).
However, if you are looking to commit this thing anyway, you can feel
free to roll up my change into it without credit -- it is kind of
stock standard for this kind of work.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists