[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHGZVBw3h_pHDaaSMeDgf3q_qn4wmkfOoG6y-CKN9sZLVQ@mail.gmail.com>
Date: Tue, 6 Aug 2024 22:03:46 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Jeff Layton <jlayton@...nel.org>
Cc: Andi Kleen <ak@...ux.intel.com>, Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>, Josef Bacik <josef@...icpanda.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
On Tue, Aug 6, 2024 at 9:26 PM Jeff Layton <jlayton@...nel.org> wrote:
>
> On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote:
> > Mateusz Guzik <mjguzik@...il.com> writes:
> > >
> > > I would bench with that myself, but I temporarily don't have handy
> > > access to bigger hw. Even so, the below is completely optional and
> > > perhaps more of a suggestion for the future :)
> > >
> > > I hacked up the test case based on tests/open1.c.
> >
> > Don't you need two test cases? One where the file exists and one
> > where it doesn't. Because the "doesn't exist" will likely be slower
> > than before because it will do the lookups twice,
> > and it will likely even slow single threaded.
> >
> > I assume the penalty will also depend on the number of entries
> > in the path.
> >
> > That all seem to be an important considerations in judging the benefits
> > of the patch.
> >
>
> Definitely.
>
> FWIW, I did test a single threaded (bespoke) test case that did a bunch
> of O_CREAT opens, closes and then unlinks. I didn't measure any
> discernable difference with this patch. My conclusion from that was
> that the extra lockless lookup should be cheap.
>
> That said, this could show a difference if you have rather long hash
> chains that need to be walked completely, and you have to actually do
> the create every time. In practice though, time spent under the
> inode_lock and doing the create tends to dominate in that case, so I
> *think* this should still be worthwhile.
>
> I'll plan to add a test like that to will_it_scale unless Mateusz beats
> me to it. A long soak in linux-next is probably also justified with
> this patch.
Well if we are really going there I have to point out a couple of
three things concerning single-threaded performance from entering the
kernel to getting back with a file descriptor.
The headline is that there is a lot of single-threaded perf left on
the table and modulo a significant hit in terms of cycles it is going
to be hard to measure a meaningful result.
Before I get to the vfs layer, there is a significant loss in the
memory allocator because of memcg -- it takes several irq off/on trips
for every alloc (needed to grab struct file *). I have a plan what to
do with it (handle stuff with local cmpxchg (note no lock prefix)),
which I'm trying to get around to. Apart from that you may note the
allocator fast path performs a 16-byte cmpxchg, which is again dog
slow and executes twice (once for the file obj, another time for the
namei buffer). Someone(tm) should patch it up and I have some vague
ideas, but 0 idea when I can take a serious stab.
As for vfs, just from atomics perspective:
- opening a results in a spurious ref/unref cycle on the dentry, i
posted a patch [1] to sort it out (maybe Al will write his own
instead). single-threaded it gives about +5% to open/close rate
- there is an avoidable smp_mb in do_dentry_open, posted a patch [2]
and it got acked. i did not bother benching it
- someone sneaked in atomic refcount management to "struct filename"
to appease io_uring vs audit. this results in another atomic added to
every single path lookup. I have an idea what to do there
- opening for write (which you do with O_CREAT) calls
mnt_get_write_access which contains another smp_mb. this probably can
be sorted out the same way percpu semaphores got taken care of
- __legitimize_mnt has yet another smp_mb to synchro against unmount.
this *probably* can be sorted out with synchronize_rcu, but some care
is needed to not induce massive delays on unmount
- credential management is also done with atomics (so that's get + put
for the open/close cycle), $elsewhere I implemented a scheme where the
local count is cached in the thread itself and only rolled up on
credential change. this whacks the problem and is something i'm
planning on proposing at some point
and more
That is to say I completely agree this does add extra work, but the
kernel is not in shape where true single-threaded impact can be
sensibly measured. I do find it prudent to validate nothing happened
to somehow floor scalability of the case which does need to create
stuff though.
[1] https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20240806172846.886570-1-mjguzik@gmail.com/
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists