[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLvpwLp=t1oz3ic-EKnaio2DhOCanmuBQ+8nSf-jzBePw@mail.gmail.com>
Date: Mon, 4 Nov 2024 09:53:20 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>, Martin KaFai Lau <martin.lau@...nel.org>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>,
Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yonghong.song@...ux.dev>,
Kui-Feng Lee <thinker.li@...il.com>
Subject: Re: [PATCH bpf-next v2] bpf: Add kernel symbol for struct_ops trampoline
On Mon, Nov 4, 2024 at 3:55 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> >> *(unsigned long *)(udata + moff) = prog->aux->id;
> >> +
> >> + /* init ksym for this trampoline */
> >> + bpf_struct_ops_ksym_init(prog, image + trampoline_start,
> >> + image_off - trampoline_start,
> >> + ksym++);
> >
> > Thanks for the patch.
> > I think it's overkill to add ksym for each callsite within a single
> > trampoline.
> > 1. The prog name will be next in the stack. No need to duplicate it.
> > 2. ksym-ing callsites this way is quite unusual.
> > 3. consider irq on other insns within a trampline.
> > The unwinder won't find anything in such a case.
> >
> > So I suggest to add only one ksym that covers the whole trampoline.
> > The name could be "bpf_trampoline_structopsname"
> > that is probably st_ops_desc->type.
> >
>
> IIUC, the "whole trampoline" for a struct_ops is actually the page
> array st_map->image_pages[MAX_TRAMP_IMAGE_PAGES], where each page is
> allocated by arch_alloc_bpf_trampoline(PAGE_SIZE).
>
> Since the virtual addresses of these pages are *NOT* guaranteed to
> be contiguous, I dont think we can create a single ksym for them.
>
> And if we add a ksym for each individual page, it seems we will end
> up with an odd name for each ksym.
I see. Good point. Ok. Let's add ksym for each callback.
> Given that each page consists of one or more bpf trampolines, which
> are not different from bpf trampolines for other prog types, such as
> bpf trampolines for fentry, and since each bpf trampoline for other
> prog types already has a ksym, I think it is not unusual to add ksym
> for each single bpf trampoline in the page.
>
> And, there are no instructions between adjacent bpf trampolines within
> a page, nothing between two trampolines can be interrupted.
>
> For the name, bpf_trampoline_<struct_ops_name>_<member_name>, like
> bpf_trampoline_tcp_congestion_ops_pkts_acked, seems appropriate.
Agree. This naming convention makes sense.
I'd only shorten the prefix to 'bpf_tramp_' or even 'bpf__'
(with double underscore).
It's kinda obvious that it's a trampoline and it's an implementation
detail that doesn't need to be present in the name.
>
> >> }
> >>
> >> if (st_ops->validate) {
> >> @@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> >> unlock:
> >> kfree(tlinks);
> >> mutex_unlock(&st_map->lock);
> >> + if (!err)
> >> + bpf_struct_ops_map_ksyms_add(st_map);
> >> return err;
> >> }
> >>
> >> @@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
> >> */
> >> synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> >>
> >> + /* no trampoline in the map is running anymore, delete symbols */
> >> + bpf_struct_ops_map_ksyms_del(st_map);
> >> + synchronize_rcu();
> >> +
> >
> > This is substantial overhead and why ?
> > synchronize_rcu_mult() is right above.
> >
>
> I think we should ensure no trampoline is running or could run before
> its ksym is deleted from the symbol table. If this order is not ensured,
> a trampoline can be interrupted by a perf irq after its symbol is deleted,
> resulting a broken stacktrace since the trampoline symbol cound not be
> found by the perf irq handler.
>
> This patch deletes ksyms after synchronize_rcu_mult() to ensure this order.
But the overhead is prohibitive. We had broken stacks with st_ops
for long time, so it may still hit 0.001% where st_ops are being switched
as the comment in bpf_struct_ops_map_free() explains.
As a separate clean up I would switch the freeing to call_rcu_tasks.
Synchronous waiting is expensive.
Martin,
any suggestions?
Powered by blists - more mailing lists