netdev - Re: [PATCH v2] net/cls_cgroup: Fix task_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96158e58-da9a-4661-a47b-e7b85856ac90@iogearbox.net>
Date: Fri, 29 Aug 2025 10:14:13 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Yafang Shao <laoar.shao@...il.com>, Paolo Abeni <pabeni@...hat.com>
Cc: davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
 horms@...nel.org, bigeasy@...utronix.de, tgraf@...g.ch, paulmck@...nel.org,
 netdev@...r.kernel.org, bpf@...r.kernel.org,
 Martin KaFai Lau <martin.lau@...ux.dev>
Subject: Re: [PATCH v2] net/cls_cgroup: Fix task_get_classid() during qdisc
 run

On 8/29/25 5:23 AM, Yafang Shao wrote:
> On Thu, Aug 28, 2025 at 3:55 PM Paolo Abeni <pabeni@...hat.com> wrote:
>> On 8/22/25 8:42 AM, Yafang Shao wrote:
>>> During recent testing with the netem qdisc to inject delays into TCP
>>> traffic, we observed that our CLS BPF program failed to function correctly
>>> due to incorrect classid retrieval from task_get_classid(). The issue
>>> manifests in the following call stack:
>>>
>>>          bpf_get_cgroup_classid+5
>>>          cls_bpf_classify+507
>>>          __tcf_classify+90
>>>          tcf_classify+217
>>>          __dev_queue_xmit+798
>>>          bond_dev_queue_xmit+43
>>>          __bond_start_xmit+211
>>>          bond_start_xmit+70
>>>          dev_hard_start_xmit+142
>>>          sch_direct_xmit+161
>>>          __qdisc_run+102             <<<<< Issue location
>>>          __dev_xmit_skb+1015
>>>          __dev_queue_xmit+637
>>>          neigh_hh_output+159
>>>          ip_finish_output2+461
>>>          __ip_finish_output+183
>>>          ip_finish_output+41
>>>          ip_output+120
>>>          ip_local_out+94
>>>          __ip_queue_xmit+394
>>>          ip_queue_xmit+21
>>>          __tcp_transmit_skb+2169
>>>          tcp_write_xmit+959
>>>          __tcp_push_pending_frames+55
>>>          tcp_push+264
>>>          tcp_sendmsg_locked+661
>>>          tcp_sendmsg+45
>>>          inet_sendmsg+67
>>>          sock_sendmsg+98
>>>          sock_write_iter+147
>>>          vfs_write+786
>>>          ksys_write+181
>>>          __x64_sys_write+25
>>>          do_syscall_64+56
>>>          entry_SYSCALL_64_after_hwframe+100
>>>
>>> The problem occurs when multiple tasks share a single qdisc. In such cases,
>>> __qdisc_run() may transmit skbs created by different tasks. Consequently,
>>> task_get_classid() retrieves an incorrect classid since it references the
>>> current task's context rather than the skb's originating task.
>>>
>>> Given that dev_queue_xmit() always executes with bh disabled, we can safely
>>> use in_softirq() instead of in_serving_softirq() to properly identify the
>>> softirq context and obtain the correct classid.
>>>
>>> The simple steps to reproduce this issue:
>>> 1. Add network delay to the network interface:
>>>    such as: tc qdisc add dev bond0 root netem delay 1.5ms
>>> 2. Create two distinct net_cls cgroups, each running a network-intensive task
>>> 3. Initiate parallel TCP streams from both tasks to external servers.
>>>
>>> Under this specific condition, the issue reliably occurs. The kernel
>>> eventually dequeues an SKB that originated from Task-A while executing in
>>> the context of Task-B.
>>>
>>> Signed-off-by: Yafang Shao <laoar.shao@...il.com>
>>> Cc: Daniel Borkmann <daniel@...earbox.net>
>>> Cc: Thomas Graf <tgraf@...g.ch>
>>> Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
>>>
>>> v1->v2: use softirq_count() instead of in_softirq()
>>> ---
>>>   include/net/cls_cgroup.h | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
>>> index 7e78e7d6f015..668aeee9b3f6 100644
>>> --- a/include/net/cls_cgroup.h
>>> +++ b/include/net/cls_cgroup.h
>>> @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
>>>         * calls by looking at the number of nested bh disable calls because
>>>         * softirqs always disables bh.
>>>         */
>>> -     if (in_serving_softirq()) {
>>> +     if (softirq_count()) {
>>>                struct sock *sk = skb_to_full_sk(skb);
>>>
>>>                /* If there is an sock_cgroup_classid we'll use that. */
>>
>> AFAICS the above changes the established behavior for a slightly
>> different scenario:
> 
> right.
> 
>> <sock S is created by task A>
>> <class ID for task A is changed>
>> <skb is created by sock S xmit and classified>
>>
>> prior to this patch the skb will be classified with the 'new' task A
>> classid, now with the old/original one.
>>
>> I'm unsure if such behavior change is acceptable;
> 
> The classid of a skb is only meaningful within its original network
> context, not from a random task.

Do you mean by original network context original netns? We also have
bpf_skb_cgroup_classid() as well as bpf_get_cgroup_classid_curr(), both
exposed to tcx, which kind of detangles what task_get_classid() is doing.
I guess if you have apps in its own netns and the skb->sk is retained all
the way to phys dev in hostns then bpf_skb_cgroup_classid() might be a
better choice (assuming classid stays constant from container orchestrator
PoV).

>> I think at very least
>> it should be mentioned in the changelog and likely this change should
>> target net-next.
> 
> Will add this to the commit log and tag it for net-next in the next version.