[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250320213145.6d016e21@pumpkin>
Date: Thu, 20 Mar 2025 21:31:45 +0000
From: David Laight <david.laight.linux@...il.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Herton Krzesinski <hkrzesin@...hat.com>, x86@...nel.org,
tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
torvalds@...ux-foundation.org, olichtne@...hat.com, atomasov@...hat.com,
aokuliar@...hat.com
Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic
(when without FSRM/ERMS)
On Thu, 20 Mar 2025 19:02:21 +0100
Mateusz Guzik <mjguzik@...il.com> wrote:
> On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@...hat.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@...il.com> wrote:
...
> > > That said, have you experimented with aligning the target to 16 bytes
> > > or more bytes?
> >
> > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> > and got no improvement at least in the specific benchmark I'm doing here.
> > Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> > well and got no difference from the 8-byte alignment, same bench as well.
> > I tried to do 8-byte alignment for the ERMS case on Intel and got no
> > difference on the systems I tested. I'm not saying it may not improve in
> > some other case, just that in my specific testing I couldn't tell/measure
> > any improvement.
> >
>
> oof, I would not got as far back as Sandy Bridge. ;)
It is a boundary point.
Agner's tables (fairly reliable have):
Sandy Bridge
Page 222
MOVS 5 4
REP MOVS 2n 1.5 n worst case
REP MOVS 3/16B 1/16B best case
which is the same as Ivy bridge - which you'd sort of expect since
Ivy bridge is a minor update, Agner's tables have the same values for it.
Haswell jumps to 1/32B.
I didn't test Sandy bridge (I've got one, powered off), but did test Ivy Bridge.
Neither the source nor destination alignment made any difference at all.
As I said earlier the only alignment that made any difference was 32byte
aligning the destination on Haswell (and later).
That is needed to get 32 bytes/clock rather than 16 bytes/clock.
>
> I think Skylake is the oldest yeller to worry about, if one insists on it.
>
> That said, if memory serves right these bufs like to be misaligned to
> weird extents, it very well may be in your tests aligning to 8 had a
> side effect of aligning it to 16 even.
>
> > >
> > > Moreover, I have some recollection that there were uarchs with ERMS
> > > which also liked the target to be aligned -- as in perhaps this should
> > > be done regardless of FSRM?
Dunno, the only report is some AMD cpu being slow with misaligned writes.
But that is the copy loop, not 'rep movsq'.
I don't have one to test.
> >
> > Where I tested I didn't see improvements but may be there is some case,
> > but I didn't have any.
> >
> > >
> > > And most importantly memset, memcpy and clear_user would all use a
> > > revamp and they are missing rep handling for bigger sizes (I verified
> > > they *do* show up). Not only that, but memcpy uses overlapping stores
> > > while memset just loops over stuff.
> > >
> > > I intended to sort it out long time ago and maybe will find some time
> > > now that I got reminded of it, but I would be deligthed if it got
> > > picked up.
> > >
> > > Hacking this up is just some screwing around, the real time consuming
> > > part is the benchmarking so I completely understand if you are not
> > > interested.
> >
> > Yes, the most time you spend is on benchmarking. May be later I could
> > try to take a look but will not put any promises on it.
I found I needed to use the performance counter to get a proper cycle count.
But then directly read the register to avoid all the 'library' overhead.
Then add lfence/mfence both sides of the cycle count read.
After subtracting the overhead of a 'null function' I could measure the
number of clocks each operation took.
So could tell when I was actually getting 32 bytes copied per clock.
(Or testing the ip checksum code the number of bytes/clock - can get to 12).
David
> >
>
> Now I'm curious enough what's up here. If I don't run out of steam,
> I'm gonna cover memset and memcpy myself.
>
Powered by blists - more mailing lists