linux-kernel - Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <f2fd90a9-bc7d-43b8-ac5e-9d233219dcfb@linux.dev>
Date: Fri, 19 Sep 2025 10:08:12 +0800
From: Tao Chen <chen.dylane@...ux.dev>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Andrii Nakryiko <andrii.nakryiko@...il.com>,
 Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
 John Fastabend <john.fastabend@...il.com>,
 Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
 <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>,
 Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
 KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
 Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
 bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for
 BPF_MAP_STACK_TRACE

在 2025/9/19 10:01, Alexei Starovoitov 写道:
> On Thu, Sep 18, 2025 at 6:35 AM Tao Chen <chen.dylane@...ux.dev> wrote:
>>
>> 在 2025/9/18 09:35, Alexei Starovoitov 写道:
>>> On Wed, Sep 17, 2025 at 3:16 PM Andrii Nakryiko
>>> <andrii.nakryiko@...il.com> wrote:
>>>>
>>>>
>>>> P.S. It seems like a good idea to switch STACKMAP to open addressing
>>>> instead of the current kind-of-bucket-chain-but-not-really
>>>> implementation. It's fixed size and pre-allocated already, so open
>>>> addressing seems like a great approach here, IMO.
>>>
>>> That makes sense. It won't have backward compat issues.
>>> Just more reliable stack_id.
>>>
>>> Fixed value_size is another footgun there.
>>> Especially for collecting user stack traces.
>>> We can switch the whole stackmap to bpf_mem_alloc()
>>> or wait for kmalloc_nolock().
>>> But it's probably a diminishing return.
>>>
>>> bpf_get_stack() also isn't great with a copy into
>>> perf_callchain_entry, then 2nd copy into on stack/percpu buf/ringbuf,
>>> and 3rd copy of correct size into ringbuf (optional).
>>>
>>> Also, I just realized we have another nasty race there.
>>> In the past bpf progs were run in preempt disabled context,
>>> but we forgot to adjust bpf_get_stack[id]() helpers when everything
>>> switched to migrate disable.
>>>
>>> The return value from get_perf_callchain() may be reused
>>> if another task preempts and requests the stack.
>>> We have partially incorrect comment in __bpf_get_stack() too:
>>>           if (may_fault)
>>>                   rcu_read_lock(); /* need RCU for perf's callchain below */
>>>
>>> rcu can be preemptable. so rcu_read_lock() makes
>>> trace = get_perf_callchain(...)
>>> accessible, but that per-cpu trace buffer can be overwritten.
>>> It's not an issue for CONFIG_PREEMPT_NONE=y, but that doesn't
>>> give much comfort.
>>
>> Hi Alexei,
>>
>> Can we fix it like this?
>>
>> -       if (may_fault)
>> -               rcu_read_lock(); /* need RCU for perf's callchain below */
>> +       preempt_diable();
>>
>>           if (trace_in)
>>                   trace = trace_in;
>> @@ -455,8 +454,7 @@ static long __bpf_get_stack(struct pt_regs *regs,
>> struct task_struct *task,
>>                                              crosstask, false);
>>
>>           if (unlikely(!trace) || trace->nr < skip) {
>> -               if (may_fault)
>> -                       rcu_read_unlock();
>> +               preempt_enable();
>>                   goto err_fault;
>>           }
>>
>> @@ -475,9 +473,7 @@ static long __bpf_get_stack(struct pt_regs *regs,
>> struct task_struct *task,
>>                   memcpy(buf, ips, copy_len);
>>           }
>>
>> -       /* trace/ips should not be dereferenced after this point */
>> -       if (may_fault)
>> -               rcu_read_unlock();
>> +       preempt_enable();
> 
> That should do it. Don't see an issue at first glance.

Ok, i will send a patch later, thanks.

-- 
Best Regards
Tao Chen