[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4cfd4808cc694f169aa8b83547ebc74d@AcuMS.aculab.com>
Date: Thu, 16 Nov 2023 16:55:33 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>,
David Howells <dhowells@...hat.com>
CC: Borislav Petkov <bp@...en8.de>,
kernel test robot <oliver.sang@...el.com>,
"oe-lkp@...ts.linux.dev" <oe-lkp@...ts.linux.dev>,
"lkp@...el.com" <lkp@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Christian Brauner <brauner@...nel.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
Christian Brauner <christian@...uner.io>,
Matthew Wilcox <willy@...radead.org>,
"ying.huang@...el.com" <ying.huang@...el.com>,
"feng.tang@...el.com" <feng.tang@...el.com>,
"fengwei.yin@...el.com" <fengwei.yin@...el.com>
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput
-16.9% regression
From: Linus Torvalds
> Sent: 16 November 2023 03:27
>
> On Wed, 15 Nov 2023 at 18:00, David Howells <dhowells@...hat.com> wrote:
...
> > A disassembly of _copy_from_iter() for the latter is attached. Note that the
> > UBUF/IOVEC still uses "rep movsb"
>
> Well, yes and no.
>
> User copies do that X86_FEATURE_FSRM alternatives dance, so the code
> gets generated with "rep movs", but you'll note that there are several
> 'nops' after it.
>
> Some of the nops are because we'll be inserting STAC/CLAC (three bytes
> each, I think) instructions around user accesses for SMAP-capable
> CPU's.
>
> But some of the nops are because we'll be rewriting that "rep stosb"
> (two bytes, iirc) as "call rep_stos_alternative" (5 bytes) on CPU's
> that don't do FSRM like yours. So your CPU won't actually be executing
> that 'rep stosb' sequence.
I presume lack of coffee is responsible for the s/movs/stos/ :-)
How much difference does FSRM actually make?
Especially when compared to the cost of a function call (even
without the horrid return thunk).
For small %cx I think non-FSRM modern cpu are ~2 clocks/byte
(no fixed overhead).
Which means 'rep movsb' wins for both short and long copies.
I wonder what sizes the function call (with all its size
based compares at the top) is actually a win.
There has to be some mileage in getting the complier to generate
'call memcpy' (for non-constant sizes) and then run-time patching
the 5 byte 'call offset' into 'mov %edx,%ecx; rep movsb'.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists