linux-kernel - Re: [PATCH 1/3] x86: fix access_ok() and valid_user_address() using wrong USER_PTR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAHk-=wj9yyNH7Xj3r_zO2vOtwfB8+vBt03Z7XRpJE9qCo-K6vg@mail.gmail.com>
Date: Thu, 6 Nov 2025 11:49:21 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: David Laight <david.laight.linux@...il.com>
Cc: Mateusz Guzik <mjguzik@...il.com>, Borislav Petkov <bp@...en8.de>, 
	"the arch/x86 maintainers" <x86@...nel.org>, brauner@...nel.org, viro@...iv.linux.org.uk, jack@...e.cz, 
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	tglx@...utronix.de, pfalcato@...e.de
Subject: Re: [PATCH 1/3] x86: fix access_ok() and valid_user_address() using
 wrong USER_PTR_MAX in modules

On Thu, 6 Nov 2025 at 11:26, David Laight <david.laight.linux@...il.com> wrote:
>
> IIRC it was a definite performance improvement for a specific workload
> (compiling kernels) on a system where the relatively small d-cache
> caused significant overhead reading the value from memory.

Some background:

  https://lore.kernel.org/lkml/20240610204821.230388-1-torvalds@linux-foundation.org/
  https://lore.kernel.org/lkml/CAHk-=whHvMbfL2ov1MRbT9QfebO2d6-xXi1ynznCCi-k_m6Q0w@mail.gmail.com/

where that "load address from memory" was particularly noticeable on
my 128-core Altra box in profiles.

That machine really has fairly weak cores and caches (it's what I call
a "flock of chickens" design: individual cores are not particularly
interesting, and the only point of that machine is "reasonable
performance on multithreaded loads thanks to many cores").

I did have numbers, but never posted them, because as mentioned in one
of the emails:

  For example, making d_hash() avoid indirection just means that now
  pretty much _all_ the cost of __d_lookup_rcu() is in the cache misses
  on the hash table itself. Which was always the bulk of it. And on my
  arm64 machine, it turns out that the best optimization for the load I
  tested would be to make that hash table smaller to actually be a bit
  denser in the cache, But that's such a load-dependent optimization
  that I'm not doing this.

IOW, the actual biggest impact on that machine was when I hacked the
dcache hash tables to be smaller, so that it fit better in the L2.

But that's one of those "tune for the benchmark and the particular
machine" things that I despise, so I never did that except locally for
testing.

The patches that actually got committed are "these improve performance
a bit by just making the code do the same thing, just being less
stupid".  Much less noticeable than the "tune for the machine".

               Linus