lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGudoHGR9nbSDTUXt02vi5VmNn4eOFbgNiTfvTThA7Kecz4SWQ@mail.gmail.com>
Date: Fri, 10 Jan 2025 20:14:45 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Kees Cook <kees@...nel.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com, 
	linux-kernel@...r.kernel.org, Thomas Weißschuh <linux@...ssschuh.net>, 
	Nilay Shroff <nilay@...ux.ibm.com>, Yury Norov <yury.norov@...il.com>, 
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>, linux-hardening@...r.kernel.org
Subject: Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput
 17.3% improvement

On Fri, Jan 10, 2025 at 5:58 PM Kees Cook <kees@...nel.org> wrote:
>
> On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> > That is to say, contrary to the report above, I believe the change is
> > in fact a regression which just so happened to make things faster for
> > a specific case. The unintended speed up can be achieved without
> > regressing anything else by taming the craziness.
>
> How do we best make sense of the perf report? Even in the iter case
> above, it looks like a perf improvement?
>

The kernel without your change compiled with gcc is leaving
performance on the table in select cases, namely when it elects to use
rep movsq for sizes below a magic threshold (depends on uarch).

Your change has the unintended side effect of changing
copy_page_from_iter_atomic to use plain memcpy, which justhappens to
be the right thing to do for this particular consumer.

However, it also has a side effect forcing of a memcpy call in places
which were optimized just fine -- for example if there is a spot where
there is a variable number of bytes to copy, but the range is small
and the upper limit is also small, gcc will elect to emit few movs and
be done with it, which is faster than calling memcpy. That is to say
for spots like that this is a regression.

In terms of optimizing all of this, the thing to do is to convince gcc
to not emit rep movsq for known problematic cases. But also not mess
with places which are optimized fine.

-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ