[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240710091631.GT27299@noisy.programming.kicks-ass.net>
Date: Wed, 10 Jul 2024 11:16:31 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Matthew Wilcox <willy@...radead.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Andrii Nakryiko <andrii.nakryiko@...il.com>,
Masami Hiramatsu <mhiramat@...nel.org>, mingo@...nel.org,
andrii@...nel.org, linux-kernel@...r.kernel.org,
rostedt@...dmis.org, oleg@...hat.com, jolsa@...nel.org,
clm@...a.com, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH 00/10] perf/uprobe: Optimize uprobes
On Tue, Jul 09, 2024 at 05:10:44PM +0100, Matthew Wilcox wrote:
> Probably best to start with lock_vma_under_rcu() in mm/memory.c.
So close and yet so far :/
> > > > Specifically, how feasible would it be to get a simple RCU based
> > > > find_vma() version sorted these days?
> > >
> > > Liam's and Willy's Maple Tree work, combined with Suren's per-VMA locking
> > > combined with some of Vlastimil's slab work is pushing in that direction.
> > > I believe that things are getting pretty close.
> >
> > So I fundamentally do not believe in per-VMA locking. Specifically for
> > this case that would be trading one hot line for another. I tried
> > telling people that, but it doesn't seem to stick :/
>
> SRCU also had its own performance problems, so we've got problems one
> way or the other. The per-VMA lock probably doesn't work quite the way
> you think it does, but it absoutely can be a hot cacheline.
>
> I did propose a store-free variant at LSFMM 2022 and again at 2023,
> but was voted down. https://lwn.net/Articles/932298/
That seems to be lacking much details. Anyway, page-tables are sorta RCU
freed -- per GUP fast requirements. Making it actually RCU isn't
hard, just increases the delay.
> I don't think the door is completely closed to a migration to that,
> but it's a harder sell than what we've got. Of course, data helps ...
Current 'problem' is a heavily multi-threaded application using uprobes.
Executable is typically one VMA (per DSO) and all the probes will live
in that one VMA. Then have all the CPUs concurrently hit probes, and
they're all hammering the same VMA in order to translate
{mm,vaddr}->{inode,offset}.
After fixing a ton of uprobe things, Andrii is seeing 45% of CPU time
spend in mmap_rwsem:
https://lkml.kernel.org/r/CAEf4BzY6tXrDGkW6mkxCY551pZa1G+Sgxeuex==nvHUEp9ynpg@mail.gmail.com
Given it's all typically one VMA, anything per VMA is not going to help
much.
Anyway, back to this PER_VMA_LOCK code, it took me a fair while to
digest because it lacks a coherent comment. It took me extra time
because I assumed (silly me) that _seq would mean an actual sequence
count.
If it were an actual sequence count, I could make it work, but sadly,
not. Also, vma_end_write() seems to be missing :-( If anything it could
be used to lockdep annotate the thing.
Mooo.. I need to stare more at this to see if perhaps it can be made to
work, but so far, no joy :/
Powered by blists - more mailing lists