[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4cfd4808cc694f169aa8b83547ebc74d@AcuMS.aculab.com>
Date:   Thu, 16 Nov 2023 16:55:33 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Linus Torvalds' <torvalds@...ux-foundation.org>,
        David Howells <dhowells@...hat.com>
CC:     Borislav Petkov <bp@...en8.de>,
        kernel test robot <oliver.sang@...el.com>,
        "oe-lkp@...ts.linux.dev" <oe-lkp@...ts.linux.dev>,
        "lkp@...el.com" <lkp@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
        Christian Brauner <christian@...uner.io>,
        Matthew Wilcox <willy@...radead.org>,
        "ying.huang@...el.com" <ying.huang@...el.com>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "fengwei.yin@...el.com" <fengwei.yin@...el.com>
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput
 -16.9% regression
From: Linus Torvalds
> Sent: 16 November 2023 03:27
> 
> On Wed, 15 Nov 2023 at 18:00, David Howells <dhowells@...hat.com> wrote:
...
> > A disassembly of _copy_from_iter() for the latter is attached.  Note that the
> > UBUF/IOVEC still uses "rep movsb"
> 
> Well, yes and no.
> 
> User copies do that X86_FEATURE_FSRM alternatives dance, so the code
> gets generated with "rep movs", but you'll note that there are several
> 'nops' after it.
> 
> Some of the nops are because we'll be inserting STAC/CLAC (three bytes
> each, I think) instructions around user accesses for SMAP-capable
> CPU's.
> 
> But some of the nops are because we'll be rewriting that "rep stosb"
> (two bytes, iirc) as "call rep_stos_alternative" (5 bytes) on CPU's
> that don't do FSRM like yours. So your CPU won't actually be executing
> that 'rep stosb' sequence.
I presume lack of coffee is responsible for the s/movs/stos/ :-)
How much difference does FSRM actually make?
Especially when compared to the cost of a function call (even
without the horrid return thunk).
For small %cx I think non-FSRM modern cpu are ~2 clocks/byte
(no fixed overhead).
Which means 'rep movsb' wins for both short and long copies.
I wonder what sizes the function call (with all its size
based compares at the top) is actually a win.
There has to be some mileage in getting the complier to generate
'call memcpy' (for non-constant sizes) and then run-time patching
the 5 byte 'call offset' into 'mov %edx,%ecx; rep movsb'.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists
 
