linux-kernel - FYI: path walking optimizations pending for 6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whHvMbfL2ov1MRbT9QfebO2d6-xXi1ynznCCi-k_m6Q0w@mail.gmail.com>
Date: Wed, 19 Jun 2024 13:25:02 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Christian Brauner <brauner@...nel.org>, Al Viro <viro@...iv.linux.org.uk>
Cc: linux-fsdevel <linux-fsdevel@...r.kernel.org>, 
	"the arch/x86 maintainers" <x86@...nel.org>, Linux ARM <linux-arm-kernel@...ts.infradead.org>, 
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: FYI: path walking optimizations pending for 6.11

I already mentioned these to Al, so he has seen most of them, because
I wanted to make sure he was ok with the link_path_walk updates. But
since he was ok (with a few comments), I cleaned things up and
separated things into branches, and here's a heads-up for a wider
audience in case anybody cares.

This all started from me doing profiling on arm64, and just being
annoyed by the code generation and some - admittedly mostly pretty
darn minor - performance issues.

It started with the arm64 user access code, moved on to
__d_lookup_rcu(), and then extended into link_path_walk(), which
together end up being the most noticeable parts of path lookup.

The user access code is mostly for strncpy_from_user() - which is the
main way the vfs layer gets the pathnames. vfs people probably don't
really care - arm people cc'd, although they've seen most of this in
earlier iterations (the minor word-at-a-time tweak is new). Same goes
for x86 people for the minor changes on that side.

I've pushed out four branches based on 6.10-rc4, because I think it's
pretty ready. But I'll rebase them if people have commentary that
needs addressing, so don't treat them as some kind of stable base yet.
My plan is to merge them during the next merge window unless somebody
screams.

The branches are:

arm64-uaccess:
    arm64: access_ok() optimization
    arm64: start using 'asm goto' for put_user()
    arm64: start using 'asm goto' for get_user() when available

link_path_walk:
    vfs: link_path_walk: move more of the name hashing into hash_name()
    vfs: link_path_walk: improve may_lookup() code generation
    vfs: link_path_walk: do '.' and '..' detection while hashing
    vfs: link_path_walk: clarify and improve name hashing interface
    vfs: link_path_walk: simplify name hash flow

runtime-constants:
    arm64: add 'runtime constant' support
    runtime constants: add x86 architecture support
    runtime constants: add default dummy infrastructure
    vfs: dcache: move hashlen_hash() from callers into d_hash()

word-at-a-time:
    arm64: word-at-a-time: improve byte count calculations for LE
    x86-64: word-at-a-time: improve byte count calculations

The arm64-uaccess branch is just what it says, and makes a big
difference in strncpy_from_user(). The "access_ok()" change is
certainly debatable, but I think needs to be done for sanity. I think
it's one of those "let's do it, and if it causes problems we'll have
to fix things up" things.

The link_path_walk branch is the one that changes the vfs layer the
most, but it's really mostly just a series of "fix calling conventions
of 'hash_name()' to be better".

The runtime-constants thing most people have already seen, it just
makes d_hash() avoid all indirect memory accesses.

And word-at-a-time just fixes code generation for both arm64 and
x86-64 to use better sequences.

None of this should be a huge deal, but together they make the
profiles for __d_lookup_rcu(), link_path_walk() and
strncpy_from_user() look pretty much optimal.

And by "optimal" I mean "within the confines of what they do".

For example, making d_hash() avoid indirection just means that now
pretty much _all_ the cost of __d_lookup_rcu() is in the cache misses
on the hash table itself. Which was always the bulk of it. And on my
arm64 machine, it turns out that the best optimization for the load I
tested would be to make that hash table smaller to actually be a bit
denser in the cache, But that's such a load-dependent optimization
that I'm not doing this.

Tuning the hash table size or data structure cacheline layouts might
be worthwhile - and likely a bigger deal - but is _not_ what these
patches are about.

           Linus