lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 17 Nov 2023 08:36:03 -0500
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Laight <David.Laight@...lab.com>
Cc:     Borislav Petkov <bp@...en8.de>,
        David Howells <dhowells@...hat.com>,
        kernel test robot <oliver.sang@...el.com>,
        "oe-lkp@...ts.linux.dev" <oe-lkp@...ts.linux.dev>,
        "lkp@...el.com" <lkp@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
        Christian Brauner <christian@...uner.io>,
        Matthew Wilcox <willy@...radead.org>,
        "ying.huang@...el.com" <ying.huang@...el.com>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "fengwei.yin@...el.com" <fengwei.yin@...el.com>,
        linux-toolchains ML <linux-toolchains@...r.kernel.org>
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput
 -16.9% regression

On Fri, 17 Nov 2023 at 08:09, David Laight <David.Laight@...lab.com> wrote:
>
> Zero length copies are different, they always take ~60 clocks.

That zero-length thing is some odd microcode implementation issue, and
I think intel actually made a FZRM cpuid bit available for it ("Fast
Zero-size Rep Movs").

I don't think we care in the kernel, but somebody else did (or maybe
Intel added a flag for "we fixed it" just because they noticed)

I at some point did some profiling, and we do have zero-length memcpy
cases occasionally (at least for user copies, which was what I was
looking at), but they aren't common enough to worry about some small
extra strange overhead.

(In case you care, it was for things like an ioctl doing "copy the
base part of the ioctl data, then copy the rest separately".  Where
"the rest" was then often nothing at all).

> My current guess for the 5000 clocks is that the logic to
> decode 'rep movsb' is loaded into a buffer that is also used
> to decode some other instructions.

Unlikely.

I would guess it's the "power up the AVX2 side". The memory copy uses
those same resources internally.

You could try to see if "first AVX memory access" (or similar) has the
same extra initial cpu cycle issue.

Anyway, the CPU you are testing is new enough to have ERMS - that's
the "we do pretty well on string instructions" flag. It does indeed do
pretty well on string instructions, but has a few oddities in addition
to the zero-sized thing.

The other bad cases tend to be along the line of "it falls flat on its
face when the source and destination address are not mutually aligned,
but they are the same virtual address modulo 4096".

Or something like that. I forget the exact details. The details do
exist, but I forget where (I suspect either Agner Fog or some footnote
in some Intel architecture manual).

So it's very much not as simple as "fixed initial cost and then a
fairly fixed cost per 32B", even if that is *one* pattern.

                Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ