[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjjF7tQ4ycPiA4gbYqF-dpTQx+VVHCDqjWR=ogqNUR51g@mail.gmail.com>
Date: Fri, 17 Nov 2023 08:36:03 -0500
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: David Laight <David.Laight@...lab.com>
Cc: Borislav Petkov <bp@...en8.de>,
David Howells <dhowells@...hat.com>,
kernel test robot <oliver.sang@...el.com>,
"oe-lkp@...ts.linux.dev" <oe-lkp@...ts.linux.dev>,
"lkp@...el.com" <lkp@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Christian Brauner <brauner@...nel.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
Christian Brauner <christian@...uner.io>,
Matthew Wilcox <willy@...radead.org>,
"ying.huang@...el.com" <ying.huang@...el.com>,
"feng.tang@...el.com" <feng.tang@...el.com>,
"fengwei.yin@...el.com" <fengwei.yin@...el.com>,
linux-toolchains ML <linux-toolchains@...r.kernel.org>
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput
-16.9% regression
On Fri, 17 Nov 2023 at 08:09, David Laight <David.Laight@...lab.com> wrote:
>
> Zero length copies are different, they always take ~60 clocks.
That zero-length thing is some odd microcode implementation issue, and
I think intel actually made a FZRM cpuid bit available for it ("Fast
Zero-size Rep Movs").
I don't think we care in the kernel, but somebody else did (or maybe
Intel added a flag for "we fixed it" just because they noticed)
I at some point did some profiling, and we do have zero-length memcpy
cases occasionally (at least for user copies, which was what I was
looking at), but they aren't common enough to worry about some small
extra strange overhead.
(In case you care, it was for things like an ioctl doing "copy the
base part of the ioctl data, then copy the rest separately". Where
"the rest" was then often nothing at all).
> My current guess for the 5000 clocks is that the logic to
> decode 'rep movsb' is loaded into a buffer that is also used
> to decode some other instructions.
Unlikely.
I would guess it's the "power up the AVX2 side". The memory copy uses
those same resources internally.
You could try to see if "first AVX memory access" (or similar) has the
same extra initial cpu cycle issue.
Anyway, the CPU you are testing is new enough to have ERMS - that's
the "we do pretty well on string instructions" flag. It does indeed do
pretty well on string instructions, but has a few oddities in addition
to the zero-sized thing.
The other bad cases tend to be along the line of "it falls flat on its
face when the source and destination address are not mutually aligned,
but they are the same virtual address modulo 4096".
Or something like that. I forget the exact details. The details do
exist, but I forget where (I suspect either Agner Fog or some footnote
in some Intel architecture manual).
So it's very much not as simple as "fixed initial cost and then a
fairly fixed cost per 32B", even if that is *one* pattern.
Linus
Powered by blists - more mailing lists