[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJmZWFFK43eS4oELFsg9OZu9UJr0Eiqy3KHLM75mv0bsXcrHwA@mail.gmail.com>
Date: Thu, 20 Mar 2025 15:37:39 -0300
From: Herton Krzesinski <hkrzesin@...hat.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: x86@...nel.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
torvalds@...ux-foundation.org, olichtne@...hat.com, atomasov@...hat.com,
aokuliar@...hat.com
Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when
without FSRM/ERMS)
On Thu, Mar 20, 2025 at 3:02 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@...hat.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@...il.com> wrote:
> > >
> > > On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@...hat.com> wrote:
> > > > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> > > > index fc9fb5d06174..b8f74d80f35c 100644
> > > > --- a/arch/x86/lib/copy_user_64.S
> > > > +++ b/arch/x86/lib/copy_user_64.S
> > > > @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
> > > > _ASM_EXTABLE_UA( 0b, 1b)
> > > >
> > > > .Llarge_movsq:
> > > > + /* Do the first possibly unaligned word */
> > > > +0: movq (%rsi),%rax
> > > > +1: movq %rax,(%rdi)
> > > > +
> > > > + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> > > > + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> > > > +
> > > > + /* What would be the offset to the aligned destination? */
> > > > + leaq 8(%rdi),%rax
> > > > + andq $-8,%rax
> > > > + subq %rdi,%rax
> > > > +
> > > > + /* .. and update pointers and count to match */
> > > > + addq %rax,%rdi
> > > > + addq %rax,%rsi
> > > > + subq %rax,%rcx
> > > > +
> > > > + /* make %rcx contain the number of words, %rax the remainder */
> > > > movq %rcx,%rax
> > > > shrq $3,%rcx
> > > > andl $7,%eax
> > >
> > > The patch looks fine to me, but there is more to do if you are up for it.
> > >
> > > It was quite some time since I last seriously played with the area and
> > > I don't remember all the details, on top of that realities of uarchs
> > > probably improved.
> > >
> > > That said, have you experimented with aligning the target to 16 bytes
> > > or more bytes?
> >
> > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> > and got no improvement at least in the specific benchmark I'm doing here.
> > Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> > well and got no difference from the 8-byte alignment, same bench as well.
> > I tried to do 8-byte alignment for the ERMS case on Intel and got no
> > difference on the systems I tested. I'm not saying it may not improve in
> > some other case, just that in my specific testing I couldn't tell/measure
> > any improvement.
> >
>
> oof, I would not got as far back as Sandy Bridge. ;)
>
> I think Skylake is the oldest yeller to worry about, if one insists on it.
>
> That said, if memory serves right these bufs like to be misaligned to
> weird extents, it very well may be in your tests aligning to 8 had a
> side effect of aligning it to 16 even.
I went back to Sandy Bridge because it was the first one I found
from newer to oldest on the Intel line that did not have erms (and
of course no fsrm). I did change the patch for a 32-byte alignment
before rep movsb in the erms case and tested with a xeon 8280
(cascade lake which is enhanced skylake, so hope is valid here),
but I did not get improvements in my testing. You may be right that
in aligning to 8-byte it might be 16-byte aligned as well, but in practice
it made no difference in my testing.
>
> > >
> > > Moreover, I have some recollection that there were uarchs with ERMS
> > > which also liked the target to be aligned -- as in perhaps this should
> > > be done regardless of FSRM?
> >
> > Where I tested I didn't see improvements but may be there is some case,
> > but I didn't have any.
> >
> > >
> > > And most importantly memset, memcpy and clear_user would all use a
> > > revamp and they are missing rep handling for bigger sizes (I verified
> > > they *do* show up). Not only that, but memcpy uses overlapping stores
> > > while memset just loops over stuff.
> > >
> > > I intended to sort it out long time ago and maybe will find some time
> > > now that I got reminded of it, but I would be deligthed if it got
> > > picked up.
> > >
> > > Hacking this up is just some screwing around, the real time consuming
> > > part is the benchmarking so I completely understand if you are not
> > > interested.
> >
> > Yes, the most time you spend is on benchmarking. May be later I could
> > try to take a look but will not put any promises on it.
> >
>
> Now I'm curious enough what's up here. If I don't run out of steam,
> I'm gonna cover memset and memcpy myself.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>
Powered by blists - more mailing lists