[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51af2860-e448-4ed1-917c-5d195a4693b5@huaweicloud.com>
Date: Tue, 5 Nov 2024 17:29:43 +0800
From: Xu Kuohai <xukuohai@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>,
Martin KaFai Lau <martin.lau@...nel.org>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
<martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>,
Yonghong Song <yonghong.song@...ux.dev>, Kui-Feng Lee <thinker.li@...il.com>
Subject: Re: [PATCH bpf-next v2] bpf: Add kernel symbol for struct_ops
trampoline
On 11/5/2024 1:53 AM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 3:55 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>>
>>>> *(unsigned long *)(udata + moff) = prog->aux->id;
>>>> +
>>>> + /* init ksym for this trampoline */
>>>> + bpf_struct_ops_ksym_init(prog, image + trampoline_start,
>>>> + image_off - trampoline_start,
>>>> + ksym++);
>>>
>>> Thanks for the patch.
>>> I think it's overkill to add ksym for each callsite within a single
>>> trampoline.
>>> 1. The prog name will be next in the stack. No need to duplicate it.
>>> 2. ksym-ing callsites this way is quite unusual.
>>> 3. consider irq on other insns within a trampline.
>>> The unwinder won't find anything in such a case.
>>>
>>> So I suggest to add only one ksym that covers the whole trampoline.
>>> The name could be "bpf_trampoline_structopsname"
>>> that is probably st_ops_desc->type.
>>>
>>
>> IIUC, the "whole trampoline" for a struct_ops is actually the page
>> array st_map->image_pages[MAX_TRAMP_IMAGE_PAGES], where each page is
>> allocated by arch_alloc_bpf_trampoline(PAGE_SIZE).
>>
>> Since the virtual addresses of these pages are *NOT* guaranteed to
>> be contiguous, I dont think we can create a single ksym for them.
>>
>> And if we add a ksym for each individual page, it seems we will end
>> up with an odd name for each ksym.
>
> I see. Good point. Ok. Let's add ksym for each callback.
>
>> Given that each page consists of one or more bpf trampolines, which
>> are not different from bpf trampolines for other prog types, such as
>> bpf trampolines for fentry, and since each bpf trampoline for other
>> prog types already has a ksym, I think it is not unusual to add ksym
>> for each single bpf trampoline in the page.
>>
>> And, there are no instructions between adjacent bpf trampolines within
>> a page, nothing between two trampolines can be interrupted.
>>
>> For the name, bpf_trampoline_<struct_ops_name>_<member_name>, like
>> bpf_trampoline_tcp_congestion_ops_pkts_acked, seems appropriate.
>
> Agree. This naming convention makes sense.
> I'd only shorten the prefix to 'bpf_tramp_' or even 'bpf__'
> (with double underscore).
> It's kinda obvious that it's a trampoline and it's an implementation
> detail that doesn't need to be present in the name.
>
OK, 'bpf__' looks great.
>>
>>>> }
>>>>
>>>> if (st_ops->validate) {
>>>> @@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>>>> unlock:
>>>> kfree(tlinks);
>>>> mutex_unlock(&st_map->lock);
>>>> + if (!err)
>>>> + bpf_struct_ops_map_ksyms_add(st_map);
>>>> return err;
>>>> }
>>>>
>>>> @@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>>>> */
>>>> synchronize_rcu_mult(call_rcu, call_rcu_tasks);
>>>>
>>>> + /* no trampoline in the map is running anymore, delete symbols */
>>>> + bpf_struct_ops_map_ksyms_del(st_map);
>>>> + synchronize_rcu();
>>>> +
>>>
>>> This is substantial overhead and why ?
>>> synchronize_rcu_mult() is right above.
>>>
>>
>> I think we should ensure no trampoline is running or could run before
>> its ksym is deleted from the symbol table. If this order is not ensured,
>> a trampoline can be interrupted by a perf irq after its symbol is deleted,
>> resulting a broken stacktrace since the trampoline symbol cound not be
>> found by the perf irq handler.
>>
>> This patch deletes ksyms after synchronize_rcu_mult() to ensure this order.
>
> But the overhead is prohibitive. We had broken stacks with st_ops
> for long time, so it may still hit 0.001% where st_ops are being switched
> as the comment in bpf_struct_ops_map_free() explains.
>
Got it
> As a separate clean up I would switch the freeing to call_rcu_tasks.
> Synchronous waiting is expensive.
>
> Martin,
>
> any suggestions?
Powered by blists - more mailing lists