[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLwV=fUkgLF3uTmevA97WX2FH4vG-7=97Px0H_WJOJieQ@mail.gmail.com>
Date: Thu, 18 Sep 2025 19:01:37 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Tao Chen <chen.dylane@...ux.dev>
Cc: Andrii Nakryiko <andrii.nakryiko@...il.com>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>,
Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
On Thu, Sep 18, 2025 at 6:35 AM Tao Chen <chen.dylane@...ux.dev> wrote:
>
> 在 2025/9/18 09:35, Alexei Starovoitov 写道:
> > On Wed, Sep 17, 2025 at 3:16 PM Andrii Nakryiko
> > <andrii.nakryiko@...il.com> wrote:
> >>
> >>
> >> P.S. It seems like a good idea to switch STACKMAP to open addressing
> >> instead of the current kind-of-bucket-chain-but-not-really
> >> implementation. It's fixed size and pre-allocated already, so open
> >> addressing seems like a great approach here, IMO.
> >
> > That makes sense. It won't have backward compat issues.
> > Just more reliable stack_id.
> >
> > Fixed value_size is another footgun there.
> > Especially for collecting user stack traces.
> > We can switch the whole stackmap to bpf_mem_alloc()
> > or wait for kmalloc_nolock().
> > But it's probably a diminishing return.
> >
> > bpf_get_stack() also isn't great with a copy into
> > perf_callchain_entry, then 2nd copy into on stack/percpu buf/ringbuf,
> > and 3rd copy of correct size into ringbuf (optional).
> >
> > Also, I just realized we have another nasty race there.
> > In the past bpf progs were run in preempt disabled context,
> > but we forgot to adjust bpf_get_stack[id]() helpers when everything
> > switched to migrate disable.
> >
> > The return value from get_perf_callchain() may be reused
> > if another task preempts and requests the stack.
> > We have partially incorrect comment in __bpf_get_stack() too:
> > if (may_fault)
> > rcu_read_lock(); /* need RCU for perf's callchain below */
> >
> > rcu can be preemptable. so rcu_read_lock() makes
> > trace = get_perf_callchain(...)
> > accessible, but that per-cpu trace buffer can be overwritten.
> > It's not an issue for CONFIG_PREEMPT_NONE=y, but that doesn't
> > give much comfort.
>
> Hi Alexei,
>
> Can we fix it like this?
>
> - if (may_fault)
> - rcu_read_lock(); /* need RCU for perf's callchain below */
> + preempt_diable();
>
> if (trace_in)
> trace = trace_in;
> @@ -455,8 +454,7 @@ static long __bpf_get_stack(struct pt_regs *regs,
> struct task_struct *task,
> crosstask, false);
>
> if (unlikely(!trace) || trace->nr < skip) {
> - if (may_fault)
> - rcu_read_unlock();
> + preempt_enable();
> goto err_fault;
> }
>
> @@ -475,9 +473,7 @@ static long __bpf_get_stack(struct pt_regs *regs,
> struct task_struct *task,
> memcpy(buf, ips, copy_len);
> }
>
> - /* trace/ips should not be dereferenced after this point */
> - if (may_fault)
> - rcu_read_unlock();
> + preempt_enable();
That should do it. Don't see an issue at first glance.
Powered by blists - more mailing lists