netdev - Re: [PATCH v2] net/cls_cgroup: Fix task_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALOAHbBq0+zNGx_yYCFVsacOZREs=8OBhGdiOBCK75k0YoPKOQ@mail.gmail.com>
Date: Sun, 31 Aug 2025 11:20:42 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: Paolo Abeni <pabeni@...hat.com>, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, horms@...nel.org, bigeasy@...utronix.de, tgraf@...g.ch, 
	paulmck@...nel.org, netdev@...r.kernel.org, bpf@...r.kernel.org, 
	Martin KaFai Lau <martin.lau@...ux.dev>
Subject: Re: [PATCH v2] net/cls_cgroup: Fix task_get_classid() during qdisc run

On Fri, Aug 29, 2025 at 4:14 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> On 8/29/25 5:23 AM, Yafang Shao wrote:
> > On Thu, Aug 28, 2025 at 3:55 PM Paolo Abeni <pabeni@...hat.com> wrote:
> >> On 8/22/25 8:42 AM, Yafang Shao wrote:
> >>> During recent testing with the netem qdisc to inject delays into TCP
> >>> traffic, we observed that our CLS BPF program failed to function correctly
> >>> due to incorrect classid retrieval from task_get_classid(). The issue
> >>> manifests in the following call stack:
> >>>
> >>>          bpf_get_cgroup_classid+5
> >>>          cls_bpf_classify+507
> >>>          __tcf_classify+90
> >>>          tcf_classify+217
> >>>          __dev_queue_xmit+798
> >>>          bond_dev_queue_xmit+43
> >>>          __bond_start_xmit+211
> >>>          bond_start_xmit+70
> >>>          dev_hard_start_xmit+142
> >>>          sch_direct_xmit+161
> >>>          __qdisc_run+102             <<<<< Issue location
> >>>          __dev_xmit_skb+1015
> >>>          __dev_queue_xmit+637
> >>>          neigh_hh_output+159
> >>>          ip_finish_output2+461
> >>>          __ip_finish_output+183
> >>>          ip_finish_output+41
> >>>          ip_output+120
> >>>          ip_local_out+94
> >>>          __ip_queue_xmit+394
> >>>          ip_queue_xmit+21
> >>>          __tcp_transmit_skb+2169
> >>>          tcp_write_xmit+959
> >>>          __tcp_push_pending_frames+55
> >>>          tcp_push+264
> >>>          tcp_sendmsg_locked+661
> >>>          tcp_sendmsg+45
> >>>          inet_sendmsg+67
> >>>          sock_sendmsg+98
> >>>          sock_write_iter+147
> >>>          vfs_write+786
> >>>          ksys_write+181
> >>>          __x64_sys_write+25
> >>>          do_syscall_64+56
> >>>          entry_SYSCALL_64_after_hwframe+100
> >>>
> >>> The problem occurs when multiple tasks share a single qdisc. In such cases,
> >>> __qdisc_run() may transmit skbs created by different tasks. Consequently,
> >>> task_get_classid() retrieves an incorrect classid since it references the
> >>> current task's context rather than the skb's originating task.
> >>>
> >>> Given that dev_queue_xmit() always executes with bh disabled, we can safely
> >>> use in_softirq() instead of in_serving_softirq() to properly identify the
> >>> softirq context and obtain the correct classid.
> >>>
> >>> The simple steps to reproduce this issue:
> >>> 1. Add network delay to the network interface:
> >>>    such as: tc qdisc add dev bond0 root netem delay 1.5ms
> >>> 2. Create two distinct net_cls cgroups, each running a network-intensive task
> >>> 3. Initiate parallel TCP streams from both tasks to external servers.
> >>>
> >>> Under this specific condition, the issue reliably occurs. The kernel
> >>> eventually dequeues an SKB that originated from Task-A while executing in
> >>> the context of Task-B.
> >>>
> >>> Signed-off-by: Yafang Shao <laoar.shao@...il.com>
> >>> Cc: Daniel Borkmann <daniel@...earbox.net>
> >>> Cc: Thomas Graf <tgraf@...g.ch>
> >>> Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> >>>
> >>> v1->v2: use softirq_count() instead of in_softirq()
> >>> ---
> >>>   include/net/cls_cgroup.h | 2 +-
> >>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
> >>> index 7e78e7d6f015..668aeee9b3f6 100644
> >>> --- a/include/net/cls_cgroup.h
> >>> +++ b/include/net/cls_cgroup.h
> >>> @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
> >>>         * calls by looking at the number of nested bh disable calls because
> >>>         * softirqs always disables bh.
> >>>         */
> >>> -     if (in_serving_softirq()) {
> >>> +     if (softirq_count()) {
> >>>                struct sock *sk = skb_to_full_sk(skb);
> >>>
> >>>                /* If there is an sock_cgroup_classid we'll use that. */
> >>
> >> AFAICS the above changes the established behavior for a slightly
> >> different scenario:
> >
> > right.
> >
> >> <sock S is created by task A>
> >> <class ID for task A is changed>
> >> <skb is created by sock S xmit and classified>
> >>
> >> prior to this patch the skb will be classified with the 'new' task A
> >> classid, now with the old/original one.
> >>
> >> I'm unsure if such behavior change is acceptable;
> >
> > The classid of a skb is only meaningful within its original network
> > context, not from a random task.
>
> Do you mean by original network context original netns? We also have
> bpf_skb_cgroup_classid() as well as bpf_get_cgroup_classid_curr(), both
> exposed to tcx, which kind of detangles what task_get_classid() is doing.
> I guess if you have apps in its own netns and the skb->sk is retained all
> the way to phys dev in hostns then bpf_skb_cgroup_classid() might be a
> better choice (assuming classid stays constant from container orchestrator
> PoV).

Right. We have replaced bpf_get_cgroup_classid() with
bpf_skb_cgroup_classid() to handle this case. Nonetheless, I believe
we still need to fix bpf_get_cgroup_classid(), since this function can
easily mislead users.

-- 
Regards
Yafang