netdev - Re: [PATCH bpf-next v2] bpf: Add kernel symbol for struct

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <51af2860-e448-4ed1-917c-5d195a4693b5@huaweicloud.com>
Date: Tue, 5 Nov 2024 17:29:43 +0800
From: Xu Kuohai <xukuohai@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>,
 Martin KaFai Lau <martin.lau@...nel.org>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
 Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
 Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
 <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>,
 Yonghong Song <yonghong.song@...ux.dev>, Kui-Feng Lee <thinker.li@...il.com>
Subject: Re: [PATCH bpf-next v2] bpf: Add kernel symbol for struct_ops
 trampoline

On 11/5/2024 1:53 AM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 3:55 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>>
>>>>                   *(unsigned long *)(udata + moff) = prog->aux->id;
>>>> +
>>>> +               /* init ksym for this trampoline */
>>>> +               bpf_struct_ops_ksym_init(prog, image + trampoline_start,
>>>> +                                        image_off - trampoline_start,
>>>> +                                        ksym++);
>>>
>>> Thanks for the patch.
>>> I think it's overkill to add ksym for each callsite within a single
>>> trampoline.
>>> 1. The prog name will be next in the stack. No need to duplicate it.
>>> 2. ksym-ing callsites this way is quite unusual.
>>> 3. consider irq on other insns within a trampline.
>>>      The unwinder won't find anything in such a case.
>>>
>>> So I suggest to add only one ksym that covers the whole trampoline.
>>> The name could be "bpf_trampoline_structopsname"
>>> that is probably st_ops_desc->type.
>>>
>>
>> IIUC, the "whole trampoline" for a struct_ops is actually the page
>> array st_map->image_pages[MAX_TRAMP_IMAGE_PAGES], where each page is
>> allocated by arch_alloc_bpf_trampoline(PAGE_SIZE).
>>
>> Since the virtual addresses of these pages are *NOT* guaranteed to
>> be contiguous, I dont think we can create a single ksym for them.
>>
>> And if we add a ksym for each individual page, it seems we will end
>> up with an odd name for each ksym.
> 
> I see. Good point. Ok. Let's add ksym for each callback.
> 
>> Given that each page consists of one or more bpf trampolines, which
>> are not different from bpf trampolines for other prog types, such as
>> bpf trampolines for fentry, and since each bpf trampoline for other
>> prog types already has a ksym, I think it is not unusual to add ksym
>> for each single bpf trampoline in the page.
>>
>> And, there are no instructions between adjacent bpf trampolines within
>> a page, nothing between two trampolines can be interrupted.
>>
>> For the name, bpf_trampoline_<struct_ops_name>_<member_name>, like
>> bpf_trampoline_tcp_congestion_ops_pkts_acked, seems appropriate.
> 
> Agree. This naming convention makes sense.
> I'd only shorten the prefix to 'bpf_tramp_' or even 'bpf__'
> (with double underscore).
> It's kinda obvious that it's a trampoline and it's an implementation
> detail that doesn't need to be present in the name.
>

OK, 'bpf__' looks great.

>>
>>>>           }
>>>>
>>>>           if (st_ops->validate) {
>>>> @@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>>>>    unlock:
>>>>           kfree(tlinks);
>>>>           mutex_unlock(&st_map->lock);
>>>> +       if (!err)
>>>> +               bpf_struct_ops_map_ksyms_add(st_map);
>>>>           return err;
>>>>    }
>>>>
>>>> @@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>>>>            */
>>>>           synchronize_rcu_mult(call_rcu, call_rcu_tasks);
>>>>
>>>> +       /* no trampoline in the map is running anymore, delete symbols */
>>>> +       bpf_struct_ops_map_ksyms_del(st_map);
>>>> +       synchronize_rcu();
>>>> +
>>>
>>> This is substantial overhead and why ?
>>> synchronize_rcu_mult() is right above.
>>>
>>
>> I think we should ensure no trampoline is running or could run before
>> its ksym is deleted from the symbol table. If this order is not ensured,
>> a trampoline can be interrupted by a perf irq after its symbol is deleted,
>> resulting a broken stacktrace since the trampoline symbol cound not be
>> found by the perf irq handler.
>>
>> This patch deletes ksyms after synchronize_rcu_mult() to ensure this order.
> 
> But the overhead is prohibitive. We had broken stacks with st_ops
> for long time, so it may still hit 0.001% where st_ops are being switched
> as the comment in bpf_struct_ops_map_free() explains.
>

Got it

> As a separate clean up I would switch the freeing to call_rcu_tasks.
> Synchronous waiting is expensive.
> 
> Martin,
> 
> any suggestions?