linux-kernel - Re: [PATCH bpf-next 1/4] selftests/bpf: Emit nop,nop5 instructions for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzaETfgoAOuVgA8r37Aso2yxQRVe8=KxGV7+B9LqPzduXw@mail.gmail.com>
Date: Mon, 24 Nov 2025 09:29:01 -0800
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Jiri Olsa <jolsa@...nel.org>
Cc: Andrii Nakryiko <andrii@...nel.org>, bpf@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>, 
	John Fastabend <john.fastabend@...il.com>
Subject: Re: [PATCH bpf-next 1/4] selftests/bpf: Emit nop,nop5 instructions
 for x86_64 usdt probe

On Mon, Nov 17, 2025 at 12:36 AM Jiri Olsa <jolsa@...nel.org> wrote:
>
> We can currently optimize uprobes on top of nop5 instructions,
> so application can define USDT_NOP to nop5 and use USDT macro
> to define optimized usdt probes.

Thanks for working on this and sorry for the delay, I've been
travelling last week.

>
> This works fine on new kernels, but could have performance penalty
> on older kernels, that do not have the support to optimize and to
> emulate nop5 instruction.
>
>   execution of the usdt probe on top of nop:
>   - nop -> trigger usdt -> emulate nop -> continue
>
>   execution of the usdt probe on top of nop5:
>   - nop5 -> trigger usdt -> single step nop5 -> continue
>
> Note the 'single step nop5' as the source of performance regression.

nit: I get what you are saying, but I don't think the above
explanation is actually as clear as it could be. Try to simplify the
reasoning maybe by saying that until Linux vX.Y kerne's uprobe
implementation treated nop5 as an instruction that needs to be
single-stepped. Newer kernels, on the other hand, can handle nop5
very-very fast (when uprobe is installed on top of them). Which
creates a dilemma where we want nop5 on new kernels, nop1 on old ones,
but we can't know upfront which kernel we'll run on. And thus the
whole patch set that's trying to have the cake and eat it too ;)

>
> To workaround that we change the USDT macro to emit nop,nop5 for
> the probe (instead of default nop) and make record of that in
> USDT record (more on that below).
>
> This can be detected by application (libbpf) and it can place the
> uprobe either on nop or nop5 based on the optimization support in
> the kernel.
>
> We make record of using the nop,nop5 instructions in the USDT ELF
> note data.
>
> Current elf note format is as follows:
>
>   namesz (4B) | descsz (4B) | type (4B) | name | desc
>
> And current usdt record (with "stapsdt" name) placed in the note's
> desc data look like:
>
>   loc_addr  | 8 bytes
>   base_addr | 8 bytes
>   sema_addr | 8 bytes
>   provider  | zero terminated string
>   name      | zero terminated string
>   args      | zero terminated string
>
> None of the tested parsers (bpftrace-bcc, libbpf) checked that the args
> zero terminated byte is the actual end of the 'desc' data. As Andrii
> suggested we could use this and place extra zero byte right there as an
> indication for the parser we use the nop,nop5 instructions.
>
> It's bit tricky, but the other way would be to introduce new elf note type
> or note name and change all existing parsers to recognize it. With the change
> above the existing parsers would still recognize such usdt probes.

... and use safer (performance-wise) nop1 as uprobe target.

We can treat this extra zero as a backwards-compatible extension of
provider+name+args section. If we ever need to have some extra flags
or extra information (e.g., argument names or whatnot), we can treat
this as either a fourth string or think about this as a single-byte
length prefix for a fixed binary extra information that should follow
(right now it's zero, so forward-compatible). For now just extra zero
is the least amount of work but good enough to solve the problem,
while being extendable for the future.

>
> Note we do not emit this extra byte if app defined its own nop through
> USDT_NOP macro.
>
> Suggested-by: Andrii Nakryiko <andrii@...nel.org>
> Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> ---
>  tools/testing/selftests/bpf/usdt.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/usdt.h b/tools/testing/selftests/bpf/usdt.h
> index 549d1f774810..57fa2902136c 100644
> --- a/tools/testing/selftests/bpf/usdt.h
> +++ b/tools/testing/selftests/bpf/usdt.h
> @@ -312,9 +312,16 @@ struct usdt_sema { volatile unsigned short active; };
>  #ifndef USDT_NOP
>  #if defined(__ia64__) || defined(__s390__) || defined(__s390x__)
>  #define USDT_NOP                       nop 0
> +#elif defined(__x86_64__)
> +#define USDT_NOP                       .byte 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x0 /* nop, nop5 */
>  #else
>  #define USDT_NOP                       nop
>  #endif
> +#else
> +/*
> + * User define its own nop instruction, do not emit extra note data.
> + */
> +#define __usdt_asm_extra

I'd guard this with ifndef, just in case user do want custom USDT_NOP
while emitting that extra zero (e.g., if they have nop1 + nop5 + some
extra they need for logging or whatever).

>  #endif /* USDT_NOP */
>
>  /*
> @@ -403,6 +410,15 @@ struct usdt_sema { volatile unsigned short active; };
>         __asm__ __volatile__ ("" :: "m" (sema));
>  #endif
>
> +#ifndef __usdt_asm_extra
> +#ifdef __x86_64__
> +#define __usdt_asm_extra                                                                       \
> +       __usdt_asm1(            .ascii "\0")

nit: keep it single line


btw, the source of truth for usdt.h is at Github, please send a proper
PR with these change there, and then we can just sync upstream version
into selftests?

pw-bot: cr




> +#else
> +#define __usdt_asm_extra
> +#endif
> +#endif
> +
>  /* main USDT definition (nop and .note.stapsdt metadata) */
>  #define __usdt_probe(group, name, sema_def, sema, ...) do {                                    \
>         sema_def(sema)                                                                          \
> @@ -420,6 +436,7 @@ struct usdt_sema { volatile unsigned short active; };
>         __usdt_asm_strz(name)                                                                   \
>         __usdt_asm_args(__VA_ARGS__)                                                            \
>         __usdt_asm1(            .ascii "\0")                                                    \
> +       __usdt_asm_extra                                                                        \
>         __usdt_asm1(994:        .balign 4)                                                      \
>         __usdt_asm1(            .popsection)                                                    \
>         __usdt_asm1(.ifndef _.stapsdt.base)                                                     \
> --
> 2.51.1
>