[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzavWOgCLQoNdmPyyqHcm7gY5USKU5f1JWfyaCbuc_zVAA@mail.gmail.com>
Date: Fri, 2 Aug 2024 22:47:15 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Matthew Wilcox <willy@...radead.org>,
"Paul E. McKenney" <paulmck@...nel.org>, Masami Hiramatsu <mhiramat@...nel.org>, mingo@...nel.org,
andrii@...nel.org, linux-kernel@...r.kernel.org, rostedt@...dmis.org,
oleg@...hat.com, jolsa@...nel.org, clm@...a.com, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH 00/10] perf/uprobe: Optimize uprobes
On Tue, Jul 30, 2024 at 11:10 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Tue, Jul 30, 2024 at 6:11 AM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > On Sat, Jul 27, 2024 at 04:45:53AM +0100, Matthew Wilcox wrote:
> >
> > > Hum. What if we added SLAB_TYPESAFE_BY_RCU to files_cachep? That way
> > > we could do:
> > >
> > > inode = NULL;
> > > rcu_read_lock();
> > > vma = find_vma(mm, address);
> > > if (!vma)
> > > goto unlock;
> > > file = READ_ONCE(vma->vm_file);
> > > if (!file)
> > > goto unlock;
> > > inode = file->f_inode;
> > > if (file != READ_ONCE(vma->vm_file))
> > > inode = NULL;
> >
> > remove_vma() does not clear vm_file, nor do I think we ever re-assign
> > this field after it is set on creation.
>
> Quite correct and even if we clear vm_file in remove_vma() and/or
> reset it on creation I don't think that would be enough. IIUC the
> warning about SLAB_TYPESAFE_BY_RCU here:
> https://elixir.bootlin.com/linux/v6.10.2/source/include/linux/slab.h#L98
> means that the vma object can be reused in the same RCU grace period.
>
> >
> > That is, I'm struggling to see what this would do. AFAICT this can still
> > happen:
> >
> > rcu_read_lock();
> > vma = find_vma();
> > remove_vma()
> > fput(vma->vm_file);
> > dup_fd)
> > newf = kmem_cache_alloc(...)
> > newf->f_inode = blah
> >
>
> Imagine that the vma got freed and reused at this point. Then
> vma->vm_file might be pointing to a valid but a completely different
> file.
>
> > file = READ_ONCE(vma->vm_file);
> > inode = file->f_inode; // blah
> > if (file != READ_ONCE(vma->vm_file)) // still match
>
> I think we should also check that the VMA represents the same area
> after we obtained the inode.
>
> >
> >
> > > unlock:
> > > rcu_read_unlock();
> > >
> > > if (inode)
> > > return inode;
> > > mmap_read_lock();
> > > vma = find_vma(mm, address);
> > > ...
> > >
> > > I think this would be safe because 'vma' will not be reused while we
> > > hold the read lock, and while 'file' might be reused, whatever f_inode
> > > points to won't be used if vm_file is no longer what it once was.
> >
> >
> > Also, we need vaddr_to_offset() which needs additional serialization
> > against vma_lock.
Is there any reason why the approach below won't work? I basically
open-coded the logic in find_active_uprobe(), doing find_vma() under
RCU lock, then fetching vma->vm_file->f_inode. If at any point any
check fails, we fallback to mmap_read_lock-protected logic, but if
not, we just validate that vma->vm_lock_seq didn't change in between
all this.
Note that we don't need to grab inode refcount, we already keep the
reference to inodes in uprobes_tree, so if we find a match (using
find_uprobe() call), then we are good. As long as that
vma->vm_file->f_inode chain didn't change, of course.
Is vma->vm_lock_seq updated on any change to a VMA? This seems to work
fine in practice, but would be good to know for sure.
Author: Andrii Nakryiko <andrii@...nel.org>
Date: Fri Aug 2 22:16:40 2024 -0700
uprobes: add speculative lockless VMA to inode resolution
Signed-off-by: Andrii Nakryiko <andrii@...nel.org>
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8be9e34e786a..e21b68a39f13 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2251,6 +2251,52 @@ static struct uprobe
*find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
struct uprobe *uprobe = NULL;
struct vm_area_struct *vma;
+#ifdef CONFIG_PER_VMA_LOCK
+ vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags;
+ struct file *vm_file;
+ struct inode *vm_inode;
+ unsigned long vm_pgoff, vm_start, vm_end;
+ int vm_lock_seq;
+ loff_t offset;
+
+ rcu_read_lock();
+
+ vma = vma_lookup(mm, bp_vaddr);
+ if (!vma)
+ goto retry_with_lock;
+
+ vm_lock_seq = READ_ONCE(vma->vm_lock_seq);
+
+ vm_file = READ_ONCE(vma->vm_file);
+ vm_flags = READ_ONCE(vma->vm_flags);
+ if (!vm_file || (vm_flags & flags) != VM_MAYEXEC)
+ goto retry_with_lock;
+
+ vm_inode = READ_ONCE(vm_file->f_inode);
+ vm_pgoff = READ_ONCE(vma->vm_pgoff);
+ vm_start = READ_ONCE(vma->vm_start);
+ vm_end = READ_ONCE(vma->vm_end);
+ if (bp_vaddr < vm_start || bp_vaddr >= vm_end)
+ goto retry_with_lock;
+
+ offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
+ uprobe = find_uprobe_rcu(vm_inode, offset);
+ if (!uprobe)
+ goto retry_with_lock;
+
+ /* now double check that nothing about VMA changed */
+ if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq))
+ goto retry_with_lock;
+
+ /* happy case, we speculated successfully */
+ rcu_read_unlock();
+ return uprobe;
+
+retry_with_lock:
+ rcu_read_unlock();
+ uprobe = NULL;
+#endif
+
mmap_read_lock(mm);
vma = vma_lookup(mm, bp_vaddr);
if (vma) {
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..211a84ee92b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
NULL);
files_cachep = kmem_cache_create("files_cache",
sizeof(struct files_struct), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
NULL);
fs_cachep = kmem_cache_create("fs_cache",
sizeof(struct fs_struct), 0,
Powered by blists - more mailing lists