linux-kernel - Re: [PATCH] fs: mark lookup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGudoHFLYVXTJg5332fOSBVT+zgzhU3s-nvwzZHPCpaOY6gR-g@mail.gmail.com>
Date: Wed, 26 Nov 2025 12:31:21 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Christian Brauner <brauner@...nel.org>
Cc: viro@...iv.linux.org.uk, jack@...e.cz, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] fs: mark lookup_slow() as noinline

On Wed, Nov 26, 2025 at 11:08 AM Christian Brauner <brauner@...nel.org> wrote:
>
> On Tue, Nov 25, 2025 at 10:54:25AM +0100, Mateusz Guzik wrote:
> > I'm going to save a rant about benchmarking changes like these in the
> > current kernel for another day.
>
> Without knocking any tooling that we currently have but I don't think we
> have _meaningful_ performance testing - especially not automated.

so *this* is the day? ;)

Even if one was to pretend for a minute that excellent benchmark suite
for vfs exists and is being used here, it would still fail to spot
numerous pessimizations.

To give you an example, legitimize_mnt has the smp_mb fence which
makes it visible when profiling things like access(2) or stat(2)
(something around 2% on my profiles). However, if one was to whack the
fence just to check if it is worth writing a real patch to do it,
access(2) perf would increase a little bit while stat(2) would remain
virtually the same. I am not speculating here, I did it. stat for me
is just shy of 4 mln ops/s. Patching the kernel with a tunable to
optionally skip the smp_mb fence pushes legitimize_mnt way down, while
*not* increasing performance -- the win is eaten by stalls elsewhere
(perf *does* increase for access(2), which is less shafted). This is
why the path walking benches I posted are all lifted from access()
usage as opposed to stat btw.

Or to put it differently, stat(2) is already gimped and you can keep
adding slowdowns without measurably altering anything, but that's only
because the CPU is already stalled big time while executing the
codepath.

Part of the systemic problem is the pesky 'rep movsq/rep stosq' usage
by gcc, notably emitted for stat (see vfs_getattr_nosec). It is my
understanding that future versions of the compiler will fix it, but
that's still years of damage to stay even if someone updates the
kernel in their distro, so that's "nice". The good news is that clang
does not do it, but it also optimized things differently in other
manners, so it may not even be representative what people will see
with gcc.

Rant aside on that front aside, I don't know what would encompass a
good test suite.

I am however confident it would include real-life usage lifted from
actual workloads for microbenchmarking purposes, like for example I
did with access() vs gcc. Better quality bench for path lookup would
involve all the syscalls invoked by gcc which do it, but per the above
the current state of the kernel would downplay improvements to next to
nothing.

Inspired by this little thing:  https://lkml.org/lkml/2015/5/19/1009
... I was screwing around with going through *all* vfs syscalls,
ordered in a way which provides the biggest data and instruction cache
busting potential. non-vfs code is not called specifically not be
shafted by slodowns elsewhere. It's not ready, but definitely worth
exploring.

I know there are some big bench suites out there (AIM?) but they look
weirdly unmaintained and I never verified if they do what they claim.

The microbenchmarks like will-it-scale are missing syscall coverage
(for example: no readlink or symlink), the syscalls which are covered
have spotty usage (for example there is a bench for parallel rw open
of a file, while opening *ro* is more common and has different
scalability), and even ignoring that all the lookups are done against
/tmp/willitscale.XXXXXX. That's not representative of most real
lookups in that there few path components *and* one of them is
unusually long.

and so on.

That rant also aside:
1. concerning legitimize_mnt: I strongly suspect the fence can be
avoided by guaranteeing that clearing ->mnt_ns waits for the rcu grace
period before issuing mntput. the question is how painful it is to
implement it
2. concering stat: the current code boils down to going to statx and
telling it to not fill some of the fields, getting some fields stat is
not going to look at anyway and finally converting the result to
userspace-compatible layout. the last bit is universal across unix
kernels afaics, curious how that happened. anyway my idea here is to
instead implement a ->stat inode op which would fill in 'struct stat'
(not kstat!), avoiding most of the current work. there is the obvious
concern of code duplication, which I think I can cover in an
acceptable manner by implementing generic helpers for fields the
filesystem does not want to mess with on its own.

that legitimize_mnt thing has been annoying me for a long time now, i
did not post any patches as the namespace code is barely readable for
me and i'm trying to not dig into it