linux-kernel - Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAGudoHGR9nbSDTUXt02vi5VmNn4eOFbgNiTfvTThA7Kecz4SWQ@mail.gmail.com>
Date: Fri, 10 Jan 2025 20:14:45 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Kees Cook <kees@...nel.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com, 
	linux-kernel@...r.kernel.org, Thomas Weißschuh <linux@...ssschuh.net>, 
	Nilay Shroff <nilay@...ux.ibm.com>, Yury Norov <yury.norov@...il.com>, 
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>, linux-hardening@...r.kernel.org
Subject: Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput
 17.3% improvement

On Fri, Jan 10, 2025 at 5:58 PM Kees Cook <kees@...nel.org> wrote:
>
> On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> > That is to say, contrary to the report above, I believe the change is
> > in fact a regression which just so happened to make things faster for
> > a specific case. The unintended speed up can be achieved without
> > regressing anything else by taming the craziness.
>
> How do we best make sense of the perf report? Even in the iter case
> above, it looks like a perf improvement?
>

The kernel without your change compiled with gcc is leaving
performance on the table in select cases, namely when it elects to use
rep movsq for sizes below a magic threshold (depends on uarch).

Your change has the unintended side effect of changing
copy_page_from_iter_atomic to use plain memcpy, which justhappens to
be the right thing to do for this particular consumer.

However, it also has a side effect forcing of a memcpy call in places
which were optimized just fine -- for example if there is a spot where
there is a variable number of bytes to copy, but the range is small
and the upper limit is also small, gcc will elect to emit few movs and
be done with it, which is faster than calling memcpy. That is to say
for spots like that this is a regression.

In terms of optimizing all of this, the thing to do is to convince gcc
to not emit rep movsq for known problematic cases. But also not mess
with places which are optimized fine.

-- 
Mateusz Guzik <mjguzik gmail.com>