[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com>
Date: Sun, 21 Jan 2024 21:10:09 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Shakeel Butt <shakeelb@...gle.com>
Cc: Roman Gushchin <roman.gushchin@...ux.dev>, Josh Poimboeuf <jpoimboe@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>, Jeff Layton <jlayton@...nel.org>,
Chuck Lever <chuck.lever@...cle.com>, Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>, linux-kernel@...r.kernel.org,
Jens Axboe <axboe@...nel.dk>, Tejun Heo <tj@...nel.org>, Vasily Averin <vasily.averin@...ux.dev>,
Michal Koutny <mkoutny@...e.com>, Waiman Long <longman@...hat.com>,
Muchun Song <muchun.song@...ux.dev>, Jiri Kosina <jikos@...nel.org>, cgroups@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH RFC 1/4] fs/locks: Fix file lock cache accounting, again
On Wed, 17 Jan 2024 at 14:56, Shakeel Butt <shakeelb@...gle.com> wrote:
> >
> > So I don't see how we can make it really cheap (say, less than 5% overhead)
> > without caching pre-accounted objects.
>
> Maybe this is what we want. Now we are down to just SLUB, maybe such
> caching of pre-accounted objects can be in SLUB layer and we can
> decide to keep this caching per-kmem-cache opt-in or always on.
So it turns out that we have another case of SLAB_ACCOUNT being quite
a big expense, and it's actually the normal - but failed - open() or
execve() case.
See the thread at
https://lore.kernel.org/all/CAHk-=whw936qzDLBQdUz-He5WK_0fRSWwKAjtbVsMGfX70Nf_Q@mail.gmail.com/
and to see the effect in profiles, you can use this EXTREMELY stupid
test program:
#include <fcntl.h>
int main(int argc, char **argv)
{
for (int i = 0; i < 10000000; i++)
open("nonexistent", O_RDONLY);
}
where the point of course is that the "nonexistent" pathname doesn't
actually exist (so don't create a file called that for the test).
What happens is that open() allocates a 'struct file *' early from the
filp kmem_cache, which has SLAB_ACCOUNT set. So we'll do accounting
for it, failt the pathname open, and free it again, which uncharges
the accounting.
Now, in this case, I actually have a suggestion: could we please just
make SLAB_ACCOUNT be something that we do *after* the allocation, kind
of the same way the zeroing works?
IOW, I'd love to get rid of slab_pre_alloc_hook() entirely, and make
slab_post_alloc_hook() do all the "charge the memcg if required".
Obviously that means that now a failure to charge the memcg would have
to then de-allocate things, but that's an uncommon path and would be
marked unlikely and not be in the hot path at all.
Now, the reason I would prefer that is that the *second* step would be to
(a) expose a "kmem_cache_charge()" function that takes a
*non*-accounted slab allocation, and turns it into an accounted one
(and obviously this is why you want to do everything in the post-alloc
hook: just try to share this code)
(b) remote the SLAB_ACCOUNT from the filp_cachep, making all file
allocations start out unaccounted.
(c) when we have *actually* looked up the pathname and open the file
successfully, at *that* point we'd do a
error = kmem_cache_charge(filp_cachep, file);
in do_dentry_open() to turn the unaccounted file pointer into an
accounted one (and if that fails, we do the cleanup and free it, of
course, exactly like we do when file_get_write_access() fails)
which means that now the failure case doesn't unnecessarily charge the
allocation that never ends up being finalized.
NOTE! I think this would clean up mm/slub.c too, simply because it
would get rid of that memcg_slab_pre_alloc_hook() entirely, and get
rid of the need to carry the "struct obj_cgroup **objcgp" pointer
along until the post-alloc hook: everything would be done post-alloc.
The actual kmem_cache_free() code already deals with "this slab hasn't
been accounted" because it obviously has to deal with allocations that
were done without __GFP_ACCOUNT anyway. So there's no change needed on
the freeing path, it already has to handle this all gracefully.
I may be missing something, but it would seem to have very little
downside, and fix a case that actually is visible right now.
Linus
Powered by blists - more mailing lists