lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wiXw+NSW6usWH31Y6n4CnF5LiOs_vJREb8_U290W9w3KQ@mail.gmail.com>
Date:   Tue, 12 Sep 2023 11:48:55 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Laight <David.Laight@...lab.com>
Cc:     Mateusz Guzik <mjguzik@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

On Mon, 11 Sept 2023 at 03:38, David Laight <David.Laight@...lab.com> wrote:
>
> The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.

Note that the hard case for 'rep movsq' is when the stores cross a
cacheline (or worse yet, a page) boundary.

That is what makes 'rep movsb' fundamentally simpler in theory. The
natural reaction is "but movsq does things 8 bytes at a time", but
once you start doing any kind of optimizations that are actually based
on bigger areas, the byte counts are actually simpler. You can always
do them as masked writes up to whatever boundary you like, and just
restart. There are never any "what about the straddling bytes" issues.

That's one of the dangers with benchmarking. Do you benchmark the
unaligned cases? How much do they matter in real life? Do they even
happen?

And that's entirely ignoring any "cold vs hot caches" etc issues, or
the "what is the cost of access _after_ the memcpy/memsert".

Or, in the case of the kernel, our issues with "function calls can now
be surprisingly expensive, and if we can inline things it can win back
20 cycles from a forced mispredict".

(And yes, I mean _any_ function calls. The indirect function calls are
even worse and more widely horrific, but sadly, with the return
prediction issues, even a perfectly regular function call is no longer
"a cycle or two")

So beware microbenchmarks. That's true in general, but it's
_particularly_ true of memset/memcpy.

                  Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ