lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQKnJkJpWkuxC32UPc4cvTnT2+YEnm8TktrEnDNO7ZbCdA@mail.gmail.com>
Date: Fri, 1 Nov 2024 11:19:32 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, 
	Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yonghong.song@...ux.dev>, 
	Kui-Feng Lee <thinker.li@...il.com>
Subject: Re: [PATCH bpf-next v2] bpf: Add kernel symbol for struct_ops trampoline

On Fri, Nov 1, 2024 at 4:08 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@...wei.com>
>
> Without kernel symbols for struct_ops trampoline, the unwinder may
> produce unexpected stacktraces.
>
> For example, the x86 ORC and FP unwinders check if an IP is in kernel
> text by verifying the presence of the IP's kernel symbol. When a
> struct_ops trampoline address is encountered, the unwinder stops due
> to the absence of symbol, resulting in an incomplete stacktrace that
> consists only of direct and indirect child functions called from the
> trampoline.
>
> The arm64 unwinder is another example. While the arm64 unwinder can
> proceed across a struct_ops trampoline address, the corresponding
> symbol name is displayed as "unknown", which is confusing.
>
> Thus, add kernel symbol for struct_ops trampoline. The name is
> bpf_trampoline_<PROG_NAME>, where PROG_NAME is the name of the
> struct_ops prog linked to the trampoline.
>
> Below is a comparison of stacktraces captured on x86 by perf record,
> before and after this patch.
>
> Before:
>
> ... FP chain: nr:4
> .....  0: ffffffffffffff80 # PERF_CONTEXT_KERNEL mark
> .....  1: ffffffff8116545d
> .....  2: ffffffff81167fcc
> .....  3: ffffffff813088f4
>  ... thread: iperf:595
>  ...... dso: /proc/kcore
> iperf     595 [002]  9015.616291:     824245  cycles:
>         ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms])
>         ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms])
>         ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms])
>
> After:
>
> ... FP chain: nr:44
> .....  0: ffffffffffffff80 # PERF_CONTEXT_KERNEL mark
> .....  1: ffffffff81165930
> .....  2: ffffffff81167fcc
> .....  3: ffffffff813088f4
> .....  4: ffffffffc000da5e
> .....  5: ffffffff81f243df
> .....  6: ffffffff81f27326
> .....  7: ffffffff81f3a3c3
> .....  8: ffffffff81f3c99b
> .....  9: ffffffff81ef9870
> ..... 10: ffffffff81ef9b13
> ..... 11: ffffffff81ef9c69
> ..... 12: ffffffff81ef9f47
> ..... 13: ffffffff81efa15d
> ..... 14: ffffffff81efa9c0
> ..... 15: ffffffff81d979eb
> ..... 16: ffffffff81d987e8
> ..... 17: ffffffff81ddce16
> ..... 18: ffffffff81bc7b90
> ..... 19: ffffffff81bcf677
> ..... 20: ffffffff81bd1b4f
> ..... 21: ffffffff81d99693
> ..... 22: ffffffff81d99a52
> ..... 23: ffffffff810c9eb2
> ..... 24: ffffffff810ca631
> ..... 25: ffffffff822367db
> ..... 26: ffffffff824015ef
> ..... 27: ffffffff811678e6
> ..... 28: ffffffff814f7d85
> ..... 29: ffffffff814f8119
> ..... 30: ffffffff81492fb9
> ..... 31: ffffffff81355c53
> ..... 32: ffffffff813d79d7
> ..... 33: ffffffff813d88fc
> ..... 34: ffffffff8139a52e
> ..... 35: ffffffff8139a661
> ..... 36: ffffffff8152c193
> ..... 37: ffffffff8152cbc5
> ..... 38: ffffffff814a5908
> ..... 39: ffffffff814a72d3
> ..... 40: ffffffff814a758b
> ..... 41: ffffffff81008869
> ..... 42: ffffffff822323e8
> ..... 43: ffffffff8240012f

The above is a visual noise.
Pls remove such addr dumps from the commit log.
The below part is enough.

>  ... thread: sleep:493
>  ...... dso: /proc/kcore
> sleep     493 [000]    55.483168:     410428  cycles:
>         ffffffff81165930 __lock_acquire+0x580 ([kernel.kallsyms])
>         ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms])
>         ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms])
>         ffffffffc000da5e bpf_trampoline_bpf_prog_075f577900bac1d2_bpf_cubic_acked+0x3a ([kernel.kallsyms])
>         ffffffff81f243df tcp_ack+0xd4f ([kernel.kallsyms])
>         ffffffff81f27326 tcp_rcv_established+0x3b6 ([kernel.kallsyms])
>         ffffffff81f3a3c3 tcp_v4_do_rcv+0x193 ([kernel.kallsyms])
>         ffffffff81f3c99b tcp_v4_rcv+0x11fb ([kernel.kallsyms])
>         ffffffff81ef9870 ip_protocol_deliver_rcu+0x50 ([kernel.kallsyms])
>         ffffffff81ef9b13 ip_local_deliver_finish+0xb3 ([kernel.kallsyms])
>         ffffffff81ef9c69 ip_local_deliver+0x79 ([kernel.kallsyms])
>         ffffffff81ef9f47 ip_sublist_rcv_finish+0xb7 ([kernel.kallsyms])
>         ffffffff81efa15d ip_sublist_rcv+0x18d ([kernel.kallsyms])
>         ffffffff81efa9c0 ip_list_rcv+0x110 ([kernel.kallsyms])
>         ffffffff81d979eb __netif_receive_skb_list_core+0x21b ([kernel.kallsyms])
>         ffffffff81d987e8 netif_receive_skb_list_internal+0x208 ([kernel.kallsyms])
>         ffffffff81ddce16 napi_gro_receive+0xf6 ([kernel.kallsyms])
>         ffffffff81bc7b90 virtnet_receive_done+0x340 ([kernel.kallsyms])
>         ffffffff81bcf677 receive_buf+0xd7 ([kernel.kallsyms])
>         ffffffff81bd1b4f virtnet_poll+0xcbf ([kernel.kallsyms])
>         ffffffff81d99693 __napi_poll.constprop.0+0x33 ([kernel.kallsyms])
>         ffffffff81d99a52 net_rx_action+0x1c2 ([kernel.kallsyms])
>         ffffffff810c9eb2 handle_softirqs+0xe2 ([kernel.kallsyms])
>         ffffffff810ca631 irq_exit_rcu+0x91 ([kernel.kallsyms])
>         ffffffff822367db sysvec_apic_timer_interrupt+0x9b ([kernel.kallsyms])
>         ffffffff824015ef asm_sysvec_apic_timer_interrupt+0x1f ([kernel.kallsyms])
>         ffffffff811678e6 lock_release+0x186 ([kernel.kallsyms])
>         ffffffff814f7d85 prepend_path+0x395 ([kernel.kallsyms])
>         ffffffff814f8119 d_path+0x159 ([kernel.kallsyms])
>         ffffffff81492fb9 file_path+0x19 ([kernel.kallsyms])
>         ffffffff81355c53 perf_event_mmap+0x1e3 ([kernel.kallsyms])
>         ffffffff813d79d7 mmap_region+0x2e7 ([kernel.kallsyms])
>         ffffffff813d88fc do_mmap+0x4ec ([kernel.kallsyms])
>         ffffffff8139a52e vm_mmap_pgoff+0xde ([kernel.kallsyms])
>         ffffffff8139a661 vm_mmap+0x31 ([kernel.kallsyms])
>         ffffffff8152c193 elf_load+0xa3 ([kernel.kallsyms])
>         ffffffff8152cbc5 load_elf_binary+0x655 ([kernel.kallsyms])
>         ffffffff814a5908 bprm_execve+0x2a8 ([kernel.kallsyms])
>         ffffffff814a72d3 do_execveat_common.isra.0+0x193 ([kernel.kallsyms])
>         ffffffff814a758b __x64_sys_execve+0x3b ([kernel.kallsyms])
>         ffffffff81008869 x64_sys_call+0x1399 ([kernel.kallsyms])
>         ffffffff822323e8 do_syscall_64+0x68 ([kernel.kallsyms])
>         ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
>
> Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
> Acked-by: Yonghong Song <yonghong.song@...ux.dev>
> ---
> v2:
> Refine the commit message for clarity and fix a test bot warning
>
> v1:
> https://lore.kernel.org/bpf/20241030111533.907289-1-xukuohai@huaweicloud.com/
> ---
>  include/linux/bpf.h         |  3 +-
>  kernel/bpf/bpf_struct_ops.c | 67 +++++++++++++++++++++++++++++++++++++
>  kernel/bpf/dispatcher.c     |  3 +-
>  kernel/bpf/trampoline.c     |  9 +++--
>  4 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c3ba4d475174..46f8d6c1a55c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1402,7 +1402,8 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func
>  void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
>                                 struct bpf_prog *to);
>  /* Called only from JIT-enabled code, so there's no need for stubs. */
> -void bpf_image_ksym_add(void *data, unsigned int size, struct bpf_ksym *ksym);
> +void bpf_image_ksym_init(void *data, unsigned int size, struct bpf_ksym *ksym);
> +void bpf_image_ksym_add(struct bpf_ksym *ksym);
>  void bpf_image_ksym_del(struct bpf_ksym *ksym);
>  void bpf_ksym_add(struct bpf_ksym *ksym);
>  void bpf_ksym_del(struct bpf_ksym *ksym);
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index fda3dd2ee984..172a081ed1c3 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -38,6 +38,9 @@ struct bpf_struct_ops_map {
>          * that stores the func args before calling the bpf_prog.
>          */
>         void *image_pages[MAX_TRAMP_IMAGE_PAGES];
> +       u32 ksyms_cnt;
> +       /* ksyms for bpf trampolines */
> +       struct bpf_ksym *ksyms;
>         /* The owner moduler's btf. */
>         struct btf *btf;
>         /* uvalue->data stores the kernel struct
> @@ -586,6 +589,35 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks,
>         return 0;
>  }
>
> +static void bpf_struct_ops_ksym_init(struct bpf_prog *prog, void *image,
> +                                    unsigned int size, struct bpf_ksym *ksym)
> +{
> +       int n;
> +
> +       n = strscpy(ksym->name, "bpf_trampoline_", KSYM_NAME_LEN);
> +       strncat(ksym->name + n, prog->aux->ksym.name, KSYM_NAME_LEN - 1 - n);
> +       INIT_LIST_HEAD_RCU(&ksym->lnode);
> +       bpf_image_ksym_init(image, size, ksym);
> +}
> +
> +static void bpf_struct_ops_map_ksyms_add(struct bpf_struct_ops_map *st_map)
> +{
> +       struct bpf_ksym *ksym = st_map->ksyms;
> +       struct bpf_ksym *end = ksym + st_map->ksyms_cnt;
> +
> +       while (ksym != end && ksym->start)
> +               bpf_image_ksym_add(ksym++);
> +}
> +
> +static void bpf_struct_ops_map_ksyms_del(struct bpf_struct_ops_map *st_map)
> +{
> +       struct bpf_ksym *ksym = st_map->ksyms;
> +       struct bpf_ksym *end = ksym + st_map->ksyms_cnt;
> +
> +       while (ksym != end && ksym->start)
> +               bpf_image_ksym_del(ksym++);
> +}
> +
>  static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>                                            void *value, u64 flags)
>  {
> @@ -601,6 +633,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>         int prog_fd, err;
>         u32 i, trampoline_start, image_off = 0;
>         void *cur_image = NULL, *image = NULL;
> +       struct bpf_ksym *ksym;
>
>         if (flags)
>                 return -EINVAL;
> @@ -640,6 +673,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>         kdata = &kvalue->data;
>
>         module_type = btf_type_by_id(btf_vmlinux, st_ops_ids[IDX_MODULE_ID]);
> +       ksym = st_map->ksyms;
>         for_each_member(i, t, member) {
>                 const struct btf_type *mtype, *ptype;
>                 struct bpf_prog *prog;
> @@ -735,6 +769,11 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>
>                 /* put prog_id to udata */
>                 *(unsigned long *)(udata + moff) = prog->aux->id;
> +
> +               /* init ksym for this trampoline */
> +               bpf_struct_ops_ksym_init(prog, image + trampoline_start,
> +                                        image_off - trampoline_start,
> +                                        ksym++);

Thanks for the patch.
I think it's overkill to add ksym for each callsite within a single
trampoline.
1. The prog name will be next in the stack. No need to duplicate it.
2. ksym-ing callsites this way is quite unusual.
3. consider irq on other insns within a trampline.
   The unwinder won't find anything in such a case.

So I suggest to add only one ksym that covers the whole trampoline.
The name could be "bpf_trampoline_structopsname"
that is probably st_ops_desc->type.

>         }
>
>         if (st_ops->validate) {
> @@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>  unlock:
>         kfree(tlinks);
>         mutex_unlock(&st_map->lock);
> +       if (!err)
> +               bpf_struct_ops_map_ksyms_add(st_map);
>         return err;
>  }
>
> @@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>          */
>         synchronize_rcu_mult(call_rcu, call_rcu_tasks);
>
> +       /* no trampoline in the map is running anymore, delete symbols */
> +       bpf_struct_ops_map_ksyms_del(st_map);
> +       synchronize_rcu();
> +

This is substantial overhead and why ?
synchronize_rcu_mult() is right above.

pw-bot: cr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ