linux-kernel - Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpFqEjG7HCx1F=Q3fScYAhaAou0Un2SFpibimkxZr7Jsbw@mail.gmail.com>
Date: Tue, 6 Aug 2024 18:36:06 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, rostedt@...dmis.org, 
	Matthew Wilcox <willy@...radead.org>, "Paul E. McKenney" <paulmck@...nel.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, mingo@...nel.org, andrii@...nel.org, 
	linux-kernel@...r.kernel.org, oleg@...hat.com, jolsa@...nel.org, clm@...a.com, 
	bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

On Mon, Aug 5, 2024 at 9:08 PM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> On Sun, Aug 4, 2024 at 4:22 PM Andrii Nakryiko
> <andrii.nakryiko@...il.com> wrote:
> >
> > On Sat, Aug 3, 2024 at 1:53 AM Peter Zijlstra <peterz@...radead.org> wrote:
> > >
> > > On Fri, Aug 02, 2024 at 10:47:15PM -0700, Andrii Nakryiko wrote:
> > >
> > > > Is there any reason why the approach below won't work?
> > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 8be9e34e786a..e21b68a39f13 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -2251,6 +2251,52 @@ static struct uprobe
> > > > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> > > >         struct uprobe *uprobe = NULL;
> > > >         struct vm_area_struct *vma;
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +       vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags;
> > > > +       struct file *vm_file;
> > > > +       struct inode *vm_inode;
> > > > +       unsigned long vm_pgoff, vm_start, vm_end;
> > > > +       int vm_lock_seq;
> > > > +       loff_t offset;
> > > > +
> > > > +       rcu_read_lock();
> > > > +
> > > > +       vma = vma_lookup(mm, bp_vaddr);
> > > > +       if (!vma)
> > > > +               goto retry_with_lock;
> > > > +
> > > > +       vm_lock_seq = READ_ONCE(vma->vm_lock_seq);
> > >
> > > So vma->vm_lock_seq is only updated on vma_start_write()
> >
> > yep, I've looked a bit more at the implementation now
> >
> > >
> > > > +
> > > > +       vm_file = READ_ONCE(vma->vm_file);
> > > > +       vm_flags = READ_ONCE(vma->vm_flags);
> > > > +       if (!vm_file || (vm_flags & flags) != VM_MAYEXEC)
> > > > +               goto retry_with_lock;
> > > > +
> > > > +       vm_inode = READ_ONCE(vm_file->f_inode);
> > > > +       vm_pgoff = READ_ONCE(vma->vm_pgoff);
> > > > +       vm_start = READ_ONCE(vma->vm_start);
> > > > +       vm_end = READ_ONCE(vma->vm_end);
> > >
> > > None of those are written with WRITE_ONCE(), so this buys you nothing.
> > > Compiler could be updating them one byte at a time while you load some
> > > franken-update.
> > >
> > > Also, if you're in the middle of split_vma() you might not get a
> > > consistent set.
> >
> > I used READ_ONCE() only to prevent the compiler from re-reading those
> > values. We assume those values are garbage anyways and double-check
> > everything, so lack of WRITE_ONCE doesn't matter. Same for
> > inconsistency if we are in the middle of split_vma().
> >
> > We use the result of all this speculative calculation only if we find
> > a valid uprobe (which could be a false positive) *and* if we detect
> > that nothing about VMA changed (which is what I got wrong, but
> > honestly I was actually betting on others to help me get this right
> > anyways).
> >
> > >
> > > > +       if (bp_vaddr < vm_start || bp_vaddr >= vm_end)
> > > > +               goto retry_with_lock;
> > > > +
> > > > +       offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > > +       uprobe = find_uprobe_rcu(vm_inode, offset);
> > > > +       if (!uprobe)
> > > > +               goto retry_with_lock;
> > > > +
> > > > +       /* now double check that nothing about VMA changed */
> > > > +       if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq))
> > > > +               goto retry_with_lock;
> > >
> > > Since vma->vma_lock_seq is only ever updated at vma_start_write() you're
> > > checking you're in or after the same modification cycle.
> > >
> > > The point of sequence locks is to check you *IN* a modification cycle
> > > and retry if you are. You're now explicitly continuing if you're in a
> > > modification.
> > >
> > > You really need:
> > >
> > >    seq++;
> > >    wmb();
> > >
> > >    ... do modification
> > >
> > >    wmb();
> > >    seq++;
> > >
> > > vs
> > >
> > >   do {
> > >           s = READ_ONCE(seq) & ~1;
> > >           rmb();
> > >
> > >           ... read stuff
> > >
> > >   } while (rmb(), seq != s);
> > >
> > >
> > > The thing to note is that seq will be odd while inside a modification
> > > and even outside, further if the pre and post seq are both even but not
> > > identical, you've crossed a modification and also need to retry.
> > >
> >
> > Ok, I don't think I got everything you have written above, sorry. But
> > let me explain what I think I need to do and please correct what I
> > (still) got wrong.
> >
> > a) before starting speculation,
> >   a.1) read and remember current->mm->mm_lock_seq (using
> > smp_load_acquire(), right?)
> >   a.2) read vma->vm_lock_seq (using smp_load_acquire() I presume)
> >   a.3) if vm_lock_seq is odd, we are already modifying VMA, so bail
> > out, try with proper mmap_lock
> > b) proceed with the inode pointer fetch and offset calculation as I've coded it
> > c) lookup uprobe by inode+offset, if failed -- bail out (if succeeded,
> > this could still be wrong)
> > d) re-read vma->vm_lock_seq, if it changed, we started modifying/have
> > already modified VMA, bail out
> > e) re-read mm->mm_lock_seq, if that changed -- presume VMA got
> > modified, bail out
> >
> > At this point we should have a guarantee that nothing about mm
> > changed, nor that VMA started being modified during our speculative
> > calculation+uprobe lookup. So if we found a valid uprobe, it must be a
> > correct one that we need.
> >
> > Is that enough? Any holes in the approach? And thanks for thoroughly
> > thinking about this, btw!
>
> Ok, with slight modifications to the details of the above (e.g., there
> is actually no "odd means VMA is being modified" thing with
> vm_lock_seq),

Correct. Instead of that (vm_lock_seq->vm_lock_seq == mm->mm_lock_seq)
means your VMA is write-locked and is being modified.

> I ended up with the implementation below. Basically we
> validate that mm->mm_lock_seq didn't change and that vm_lock_seq !=
> mm_lock_seq (which otherwise would mean "VMA is being modified").

Validating that mm->mm_lock_seq did not change does not provide you
with useful information. It only means that between the point where
you recorded mm->mm_lock_seq and where you are checking it, there was
an mmap_write_unlock() or mmap_write_downgrade() call. Your VMA might
not have even been part of that modification for which mmap_lock was
taken.

In theory what you need is simpler (simplified code for explanation only):

int vm_lock_seq = vma->vm_lock_seq;
if (vm_lock_seq == mm->mm_lock_seq)
        goto bail_out; /* VMA is write-locked */

/* copy required VMA attributes */

if (vm_lock_seq != vma->vm_lock_seq)
        goto bail_out; /* VMA got write-locked */

But this would require proper ACQUIRE/RELEASE semantics for
vma->vm_lock_seq which is currently not there because all reads/writes
to vma->vm_lock_seq that matter are done under vma->vm_lock->lock
protection, so additional ordering is not required. If you decide to
add that semantics for vma->vm_lock_seq, please make sure that
pagefault path performance does not regress.

> There is a possibility that vm_lock_seq == mm_lock_seq just by
> accident, which is not a correctness problem, we'll just fallback to
> locked implementation until something about VMA or mm_struct itself
> changes. Which is fine, and if mm folks ever change this locking
> schema, this might go away.
>
> If this seems on the right track, I think we can just move
> mm_start_vma_specuation()/mm_end_vma_speculation() into
> include/linux/mm.h.
>
> And after thinking a bit more about READ_ONCE() usage, I changed them
> to data_race() to not trigger KCSAN warnings. Initially I kept
> READ_ONCE() only around vma->vm_file access, but given we never change
> it until vma is freed and reused (which would be prevented by
> guard(rcu)), I dropped READ_ONCE() and only added data_race(). And
> even data_race() is probably not necessary.
>
> Anyways, please see the patch below. Would be nice if mm folks
> (Suren?) could confirm that this is not broken.
>
>
>
> Author: Andrii Nakryiko <andrii@...nel.org>
> Date:   Fri Aug 2 22:16:40 2024 -0700
>
>     uprobes: add speculative lockless VMA to inode resolution
>
>     Signed-off-by: Andrii Nakryiko <andrii@...nel.org>
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 3de311c56d47..bee7a929ff02 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2244,6 +2244,70 @@ static int is_trap_at_addr(struct mm_struct
> *mm, unsigned long vaddr)
>         return is_trap_insn(&opcode);
>  }
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void mm_start_vma_speculation(struct mm_struct *mm, int
> *mm_lock_seq)
> +{
> +       *mm_lock_seq = smp_load_acquire(&mm->mm_lock_seq);
> +}
> +
> +/* returns true if speculation was safe (no mm and vma modification
> happened) */
> +static inline bool mm_end_vma_speculation(struct vm_area_struct *vma,
> int mm_lock_seq)
> +{
> +       int mm_seq, vma_seq;
> +
> +       mm_seq = smp_load_acquire(&vma->vm_mm->mm_lock_seq);
> +       vma_seq = READ_ONCE(vma->vm_lock_seq);
> +
> +       return mm_seq == mm_lock_seq && vma_seq != mm_seq;

After spending some time on this I think what you do here is
semantically correct but sub-optimal.

This check means that there was no call to
mmap_write_unlock()/mmap_write_downgrade() since
mm_start_vma_speculation() and the vma is not currently locked. To
unlock a write-locked VMA you do need to call
map_write_unlock()/mmap_write_downgrade(), so I think this check would
guarantee that your vma was not locked and modified from under us.
However this will also trigger false positives if
mmap_write_unlock()/mmap_write_downgrade() was called but the vma you
are using was never locked. So, it will bail out more than necessary.
Maybe it's ok?

> +}
> +
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> +       const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> +       struct mm_struct *mm = current->mm;
> +       struct uprobe *uprobe;
> +       struct vm_area_struct *vma;
> +       struct file *vm_file;
> +       struct inode *vm_inode;
> +       unsigned long vm_pgoff, vm_start;
> +       int mm_lock_seq;
> +       loff_t offset;
> +
> +       guard(rcu)();
> +
> +       mm_start_vma_speculation(mm, &mm_lock_seq);
> +
> +       vma = vma_lookup(mm, bp_vaddr);
> +       if (!vma)
> +               return NULL;
> +
> +       vm_file = data_race(vma->vm_file);
> +       if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> +               return NULL;
> +
> +       vm_inode = data_race(vm_file->f_inode);
> +       vm_pgoff = data_race(vma->vm_pgoff);
> +       vm_start = data_race(vma->vm_start);
> +
> +       offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> +       uprobe = find_uprobe_rcu(vm_inode, offset);
> +       if (!uprobe)
> +               return NULL;
> +
> +       /* now double check that nothing about MM and VMA changed */
> +       if (!mm_end_vma_speculation(vma, mm_lock_seq))
> +               return NULL;
> +
> +       /* happy case, we speculated successfully */
> +       return uprobe;
> +}
> +#else /* !CONFIG_PER_VMA_LOCK */
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> +       return NULL;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  /* assumes being inside RCU protected region */
>  static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr,
> int *is_swbp)
>  {
> @@ -2251,6 +2315,10 @@ static struct uprobe
> *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
>         struct uprobe *uprobe = NULL;
>         struct vm_area_struct *vma;
>
> +       uprobe = find_active_uprobe_speculative(bp_vaddr);
> +       if (uprobe)
> +               return uprobe;
> +
>         mmap_read_lock(mm);
>         vma = vma_lookup(mm, bp_vaddr);
>         if (vma) {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cc760491f201..211a84ee92b4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
>                         NULL);
>         files_cachep = kmem_cache_create("files_cache",
>                         sizeof(struct files_struct), 0,
> -                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> +
> SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
>                         NULL);
>         fs_cachep = kmem_cache_create("fs_cache",
>                         sizeof(struct fs_struct), 0,
>
>
> >
> > P.S. This is basically the last big blocker towards linear uprobes
> > scalability with the number of active CPUs. I have
> > uretprobe+SRCU+timeout implemented and it seems to work fine, will
> > post soon-ish.
> >
> > P.P.S Also, funny enough, below was another big scalability limiter
> > (and the last one) :) I'm not sure if we can just drop it, or I should
> > use per-CPU counter, but with the below change and speculative VMA
> > lookup (however buggy, works ok for benchmarking), I finally get
> > linear scaling of uprobe triggering throughput with number of CPUs. We
> > are very close.
> >
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index f7443e996b1b..64c2bc316a08 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1508,7 +1508,7 @@ static int uprobe_dispatcher(struct
> > uprobe_consumer *con, struct pt_regs *regs)
> >         int ret = 0;
> >
> >         tu = container_of(con, struct trace_uprobe, consumer);
> > -       tu->nhit++;
> > +       //tu->nhit++;
> >
> >         udd.tu = tu;
> >         udd.bp_addr = instruction_pointer(regs);
> >
> >
> > > > +
> > > > +       /* happy case, we speculated successfully */
> > > > +       rcu_read_unlock();
> > > > +       return uprobe;
> > > > +
> > > > +retry_with_lock:
> > > > +       rcu_read_unlock();
> > > > +       uprobe = NULL;
> > > > +#endif
> > > > +
> > > >         mmap_read_lock(mm);
> > > >         vma = vma_lookup(mm, bp_vaddr);
> > > >         if (vma) {
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index cc760491f201..211a84ee92b4 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
> > > >                         NULL);
> > > >         files_cachep = kmem_cache_create("files_cache",
> > > >                         sizeof(struct files_struct), 0,
> > > > -                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > > > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
> > > >                         NULL);
> > > >         fs_cachep = kmem_cache_create("fs_cache",
> > > >                         sizeof(struct fs_struct), 0,