[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSa6d8bm7Jm6Ojn7@krava>
Date: Wed, 26 Nov 2025 09:29:43 +0100
From: Jiri Olsa <olsajiri@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Andrii Nakryiko <andrii@...nel.org>, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org, Song Liu <songliubraving@...com>,
Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>
Subject: Re: [PATCH bpf-next 1/4] selftests/bpf: Emit nop,nop5 instructions
for x86_64 usdt probe
On Mon, Nov 24, 2025 at 09:29:01AM -0800, Andrii Nakryiko wrote:
> On Mon, Nov 17, 2025 at 12:36 AM Jiri Olsa <jolsa@...nel.org> wrote:
> >
> > We can currently optimize uprobes on top of nop5 instructions,
> > so application can define USDT_NOP to nop5 and use USDT macro
> > to define optimized usdt probes.
>
> Thanks for working on this and sorry for the delay, I've been
> travelling last week.
>
> >
> > This works fine on new kernels, but could have performance penalty
> > on older kernels, that do not have the support to optimize and to
> > emulate nop5 instruction.
> >
> > execution of the usdt probe on top of nop:
> > - nop -> trigger usdt -> emulate nop -> continue
> >
> > execution of the usdt probe on top of nop5:
> > - nop5 -> trigger usdt -> single step nop5 -> continue
> >
> > Note the 'single step nop5' as the source of performance regression.
>
> nit: I get what you are saying, but I don't think the above
> explanation is actually as clear as it could be. Try to simplify the
> reasoning maybe by saying that until Linux vX.Y kerne's uprobe
> implementation treated nop5 as an instruction that needs to be
> single-stepped. Newer kernels, on the other hand, can handle nop5
> very-very fast (when uprobe is installed on top of them). Which
> creates a dilemma where we want nop5 on new kernels, nop1 on old ones,
> but we can't know upfront which kernel we'll run on. And thus the
> whole patch set that's trying to have the cake and eat it too ;)
ok, will try ;-)
>
> >
> > To workaround that we change the USDT macro to emit nop,nop5 for
> > the probe (instead of default nop) and make record of that in
> > USDT record (more on that below).
> >
> > This can be detected by application (libbpf) and it can place the
> > uprobe either on nop or nop5 based on the optimization support in
> > the kernel.
> >
> > We make record of using the nop,nop5 instructions in the USDT ELF
> > note data.
> >
> > Current elf note format is as follows:
> >
> > namesz (4B) | descsz (4B) | type (4B) | name | desc
> >
> > And current usdt record (with "stapsdt" name) placed in the note's
> > desc data look like:
> >
> > loc_addr | 8 bytes
> > base_addr | 8 bytes
> > sema_addr | 8 bytes
> > provider | zero terminated string
> > name | zero terminated string
> > args | zero terminated string
> >
> > None of the tested parsers (bpftrace-bcc, libbpf) checked that the args
> > zero terminated byte is the actual end of the 'desc' data. As Andrii
> > suggested we could use this and place extra zero byte right there as an
> > indication for the parser we use the nop,nop5 instructions.
> >
> > It's bit tricky, but the other way would be to introduce new elf note type
> > or note name and change all existing parsers to recognize it. With the change
> > above the existing parsers would still recognize such usdt probes.
>
> ... and use safer (performance-wise) nop1 as uprobe target.
>
> We can treat this extra zero as a backwards-compatible extension of
> provider+name+args section. If we ever need to have some extra flags
> or extra information (e.g., argument names or whatnot), we can treat
> this as either a fourth string or think about this as a single-byte
> length prefix for a fixed binary extra information that should follow
> (right now it's zero, so forward-compatible). For now just extra zero
> is the least amount of work but good enough to solve the problem,
> while being extendable for the future.
ok, makes sense
>
> >
> > Note we do not emit this extra byte if app defined its own nop through
> > USDT_NOP macro.
> >
> > Suggested-by: Andrii Nakryiko <andrii@...nel.org>
> > Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> > ---
> > tools/testing/selftests/bpf/usdt.h | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/usdt.h b/tools/testing/selftests/bpf/usdt.h
> > index 549d1f774810..57fa2902136c 100644
> > --- a/tools/testing/selftests/bpf/usdt.h
> > +++ b/tools/testing/selftests/bpf/usdt.h
> > @@ -312,9 +312,16 @@ struct usdt_sema { volatile unsigned short active; };
> > #ifndef USDT_NOP
> > #if defined(__ia64__) || defined(__s390__) || defined(__s390x__)
> > #define USDT_NOP nop 0
> > +#elif defined(__x86_64__)
> > +#define USDT_NOP .byte 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x0 /* nop, nop5 */
> > #else
> > #define USDT_NOP nop
> > #endif
> > +#else
> > +/*
> > + * User define its own nop instruction, do not emit extra note data.
> > + */
> > +#define __usdt_asm_extra
>
> I'd guard this with ifndef, just in case user do want custom USDT_NOP
> while emitting that extra zero (e.g., if they have nop1 + nop5 + some
> extra they need for logging or whatever).
ok
>
> > #endif /* USDT_NOP */
> >
> > /*
> > @@ -403,6 +410,15 @@ struct usdt_sema { volatile unsigned short active; };
> > __asm__ __volatile__ ("" :: "m" (sema));
> > #endif
> >
> > +#ifndef __usdt_asm_extra
> > +#ifdef __x86_64__
> > +#define __usdt_asm_extra \
> > + __usdt_asm1( .ascii "\0")
>
> nit: keep it single line
>
>
> btw, the source of truth for usdt.h is at Github, please send a proper
> PR with these change there, and then we can just sync upstream version
> into selftests?
yes, will do that
thanks,
jirka
Powered by blists - more mailing lists