linux-kernel - Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <pep7tppcmd77ejaa47bhajc3uoy2q2n3cladgc4btdri4mth65@dqjulq2hx4l2>
Date: Thu, 9 Jan 2025 21:52:31 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Kees Cook <kees@...nel.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, 
	lkp@...el.com, linux-kernel@...r.kernel.org, 
	Thomas Weißschuh <linux@...ssschuh.net>, Nilay Shroff <nilay@...ux.ibm.com>, 
	Yury Norov <yury.norov@...il.com>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, 
	linux-hardening@...r.kernel.org
Subject: Re: [linus:master] [fortify]  239d87327d:  vm-scalability.throughput
 17.3% improvement

On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > 
> > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > Well that is unexpected. There should be no binary output difference
> > with that patch. I will investigate...
> 
> It looks like hiding the size value from GCC has the side-effect of
> breaking memcpy inlining in many places. I would expect this to make
> things _slower_, though. O_o
> 

This depends on what was emitted in place and what CPU is executing it.

Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
not have FSRM and the size is low enough, then such code can indeed be
slower than suffering a call to memcpy (which does not issue rep mov).

I had seen gcc go to great pains to align a buffer for rep movsq even
when it was guaranteed to not be necessary for example.

Can you disasm an example affected spot?

Gcc has a bunch of magic switches to tell it what to emit in line, the
thing to do is to convince it to roll with a bunch of mov (not rep mov)
for sizes small enough(tm). What constitutes small enough depends on the
uarch.