linux-kernel - Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wiXw+NSW6usWH31Y6n4CnF5LiOs_vJREb8_U290W9w3KQ@mail.gmail.com>
Date:   Tue, 12 Sep 2023 11:48:55 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Laight <David.Laight@...lab.com>
Cc:     Mateusz Guzik <mjguzik@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

On Mon, 11 Sept 2023 at 03:38, David Laight <David.Laight@...lab.com> wrote:
>
> The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.

Note that the hard case for 'rep movsq' is when the stores cross a
cacheline (or worse yet, a page) boundary.

That is what makes 'rep movsb' fundamentally simpler in theory. The
natural reaction is "but movsq does things 8 bytes at a time", but
once you start doing any kind of optimizations that are actually based
on bigger areas, the byte counts are actually simpler. You can always
do them as masked writes up to whatever boundary you like, and just
restart. There are never any "what about the straddling bytes" issues.

That's one of the dangers with benchmarking. Do you benchmark the
unaligned cases? How much do they matter in real life? Do they even
happen?

And that's entirely ignoring any "cold vs hot caches" etc issues, or
the "what is the cost of access _after_ the memcpy/memsert".

Or, in the case of the kernel, our issues with "function calls can now
be surprisingly expensive, and if we can inline things it can win back
20 cycles from a forced mispredict".

(And yes, I mean _any_ function calls. The indirect function calls are
even worse and more widely horrific, but sadly, with the return
prediction issues, even a perfectly regular function call is no longer
"a cycle or two")

So beware microbenchmarks. That's true in general, but it's
_particularly_ true of memset/memcpy.

                  Linus