[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aFwS2EENyOFh7IbY@krava>
Date: Wed, 25 Jun 2025 17:16:40 +0200
From: Jiri Olsa <olsajiri@...il.com>
To: Masami Hiramatsu <mhiramat@...nel.org>
Cc: Oleg Nesterov <oleg@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Andrii Nakryiko <andrii@...nel.org>, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
x86@...nel.org, Song Liu <songliubraving@...com>,
Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
Hao Luo <haoluo@...gle.com>, Steven Rostedt <rostedt@...dmis.org>,
Alan Maguire <alan.maguire@...cle.com>,
David Laight <David.Laight@...lab.com>,
Thomas Weißschuh <thomas@...ch.de>,
Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCHv3 perf/core 08/22] uprobes/x86: Add mapping for optimized
uprobe trampolines
On Wed, Jun 25, 2025 at 05:21:22PM +0900, Masami Hiramatsu wrote:
> On Thu, 5 Jun 2025 15:23:35 +0200
> Jiri Olsa <jolsa@...nel.org> wrote:
>
> > Adding support to add special mapping for user space trampoline with
> > following functions:
> >
> > uprobe_trampoline_get - find or add uprobe_trampoline
> > uprobe_trampoline_put - remove or destroy uprobe_trampoline
> >
> > The user space trampoline is exported as arch specific user space special
> > mapping through tramp_mapping, which is initialized in following changes
> > with new uprobe syscall.
> >
> > The uprobe trampoline needs to be callable/reachable from the probed address,
> > so while searching for available address we use is_reachable_by_call function
> > to decide if the uprobe trampoline is callable from the probe address.
> >
> > All uprobe_trampoline objects are stored in uprobes_state object and are
> > cleaned up when the process mm_struct goes down. Adding new arch hooks
> > for that, because this change is x86_64 specific.
> >
> > Locking is provided by callers in following changes.
> >
> > Acked-by: Oleg Nesterov <oleg@...hat.com>
> > Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> > ---
> > arch/x86/kernel/uprobes.c | 115 ++++++++++++++++++++++++++++++++++++++
> > include/linux/uprobes.h | 6 ++
> > kernel/events/uprobes.c | 10 ++++
> > kernel/fork.c | 1 +
> > 4 files changed, 132 insertions(+)
> >
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 77050e5a4680..0295cfb625c0 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -608,6 +608,121 @@ static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
> > *sr = utask->autask.saved_scratch_register;
> > }
> > }
> > +
> > +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> > +{
> > + return -EPERM;
> > +}
> > +
> > +static struct page *tramp_mapping_pages[2] __ro_after_init;
> > +
> > +static struct vm_special_mapping tramp_mapping = {
> > + .name = "[uprobes-trampoline]",
> > + .mremap = tramp_mremap,
> > + .pages = tramp_mapping_pages,
> > +};
> > +
> > +struct uprobe_trampoline {
> > + struct hlist_node node;
> > + unsigned long vaddr;
> > +};
> > +
> > +static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> > +{
> > + long delta = (long)(vaddr + 5 - vtramp);
> > +
> > + return delta >= INT_MIN && delta <= INT_MAX;
> > +}
> > +
> > +static unsigned long find_nearest_page(unsigned long vaddr)
>
> nit: this does not return the nearest one, but the highest one?
...
>
> If you really need the nearest one, we need to call
> vm_unmapped_area() twice.
>
> [low_limit, call_end] with TOPDOWN flag and
> [call_end, high_limit] without TOPDOWN.
>
> and choose the nearest one. But I don't think we need it.
ugh you're right, let's rename it.. find_reachable_page ?
>
> > +{
> > + struct vm_unmapped_area_info info = {
> > + .length = PAGE_SIZE,
> > + .align_mask = ~PAGE_MASK,
> > + .flags = VM_UNMAPPED_AREA_TOPDOWN,
> > + .low_limit = PAGE_SIZE,
> > + .high_limit = ULONG_MAX,
>
> Maybe "TASK_SIZE" is better than ULONG_MAX?
ok
>
> > + };
> > + unsigned long limit, call_end = vaddr + 5;
> > +
> > + if (!check_add_overflow(call_end, INT_MIN, &limit))
> > + info.low_limit = limit;
> > + if (!check_add_overflow(call_end, INT_MAX, &limit))
> > + info.high_limit = limit;
> > + return vm_unmapped_area(&info);
> > +}
> > +
> > +static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > +{
> > + struct pt_regs *regs = task_pt_regs(current);
> > + struct mm_struct *mm = current->mm;
> > + struct uprobe_trampoline *tramp;
> > + struct vm_area_struct *vma;
> > +
> > + if (!user_64bit_mode(regs))
> > + return NULL;
> > +
> > + vaddr = find_nearest_page(vaddr);
> > + if (IS_ERR_VALUE(vaddr))
> > + return NULL;
> > +
> > + tramp = kzalloc(sizeof(*tramp), GFP_KERNEL);
> > + if (unlikely(!tramp))
> > + return NULL;
> > +
> > + tramp->vaddr = vaddr;
> > + vma = _install_special_mapping(mm, tramp->vaddr, PAGE_SIZE,
>
> Just make sure, this special mapped page is mapped 1 page for each
> uprobe? (I think uprobe syscall trampoline size is far smaller
> than the page size.)
so the trampoline is created for first uprobe within 4GB region of
the probed address and will be reused by other uprobes in that region
>
> > + VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > + &tramp_mapping);
> > + if (IS_ERR(vma))
> > + goto free_area;
>
> nit: To simplify the code, instead of goto,
>
> if (IS_ERR(vma)) {
> kfree(tramp);
> return NULL;
> }
ok
>
> > + return tramp;
> > +
> > +free_area:
> > + kfree(tramp);
> > + return NULL;
> > +}
> > +
> > +__maybe_unused
> > +static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
> > +{
> > + struct uprobes_state *state = ¤t->mm->uprobes_state;
> > + struct uprobe_trampoline *tramp = NULL;
> > +
> > + hlist_for_each_entry(tramp, &state->head_tramps, node) {
> > + if (is_reachable_by_call(tramp->vaddr, vaddr))
>
> This should set '*new = false;' here.
right, will fix, thanks
>
> > + return tramp;
> > + }
> > +
> > + tramp = create_uprobe_trampoline(vaddr);
> > + if (!tramp)
> > + return NULL;
> > +
> > + *new = true;
> > + hlist_add_head(&tramp->node, &state->head_tramps);
> > + return tramp;
> > +}
> > +
> > +static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > +{
> > + hlist_del(&tramp->node);
> > + kfree(tramp);
>
> Don't we need to unmap the tramp->vaddr?
that's tricky because we have no way to make sure the application is
no longer executing the trampoline, it's described in the changelog
of following patch:
uprobes/x86: Add support to optimize uprobes
...
We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.
We do waste frame on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.
...
>
> > +}
> > +
> > +void arch_uprobe_init_state(struct mm_struct *mm)
> > +{
> > + INIT_HLIST_HEAD(&mm->uprobes_state.head_tramps);
> > +}
> > +
> > +void arch_uprobe_clear_state(struct mm_struct *mm)
> > +{
> > + struct uprobes_state *state = &mm->uprobes_state;
> > + struct uprobe_trampoline *tramp;
> > + struct hlist_node *n;
> > +
> > + hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
> > + destroy_uprobe_trampoline(tramp);
> > +}
> > #else /* 32-bit: */
> > /*
> > * No RIP-relative addressing on 32-bit
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 5080619560d4..b40d33aae016 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -17,6 +17,7 @@
> > #include <linux/wait.h>
> > #include <linux/timer.h>
> > #include <linux/seqlock.h>
> > +#include <linux/mutex.h>
> >
> > struct uprobe;
> > struct vm_area_struct;
> > @@ -185,6 +186,9 @@ struct xol_area;
> >
> > struct uprobes_state {
> > struct xol_area *xol_area;
> > +#ifdef CONFIG_X86_64
>
> Maybe we can introduce struct arch_uprobe_state{} here?
ok, on top of that Andrii also asked for [1]:
- alloc 'struct uprobes_state' for mm_struct only when needed
this could be part of that follow up? I'd rather not complicate this
patchset any further
[1] https://lore.kernel.org/bpf/CAEf4BzY2zKPM9JHgn_wa8yCr8q5KntE5w8g=AoT2MnrD2Dx6gA@mail.gmail.com/
>
> > + struct hlist_head head_tramps;
> > +#endif
> > };
> >
> > typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
> > @@ -233,6 +237,8 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > extern void *arch_uretprobe_trampoline(unsigned long *psize);
> > extern unsigned long uprobe_get_trampoline_vaddr(void);
> > extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> > +extern void arch_uprobe_clear_state(struct mm_struct *mm);
> > +extern void arch_uprobe_init_state(struct mm_struct *mm);
> > #else /* !CONFIG_UPROBES */
> > struct uprobes_state {
> > };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 6795b8d82b9c..acec91a676b7 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1802,6 +1802,14 @@ static struct xol_area *get_xol_area(void)
> > return area;
> > }
> >
> > +void __weak arch_uprobe_clear_state(struct mm_struct *mm)
> > +{
> > +}
> > +
> > +void __weak arch_uprobe_init_state(struct mm_struct *mm)
> > +{
> > +}
> > +
> > /*
> > * uprobe_clear_state - Free the area allocated for slots.
> > */
> > @@ -1813,6 +1821,8 @@ void uprobe_clear_state(struct mm_struct *mm)
> > delayed_uprobe_remove(NULL, mm);
> > mutex_unlock(&delayed_uprobe_lock);
> >
> > + arch_uprobe_clear_state(mm);
> > +
> > if (!area)
> > return;
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 1ee8eb11f38b..7108ca558518 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1010,6 +1010,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
> > {
> > #ifdef CONFIG_UPROBES
> > mm->uprobes_state.xol_area = NULL;
> > + arch_uprobe_init_state(mm);
> > #endif
>
> Can't we make this uprobe_init_state(mm)?
hum, there are other mm_init_* functions around, I guess we should keep
the same pattern?
unless you mean s/arch_uprobe_init_state/uprobe_init_state/ but that's
arch code.. so probably not sure what you mean ;-)
thanks for review,
jirka
Powered by blists - more mailing lists