[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96158e58-da9a-4661-a47b-e7b85856ac90@iogearbox.net>
Date: Fri, 29 Aug 2025 10:14:13 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Yafang Shao <laoar.shao@...il.com>, Paolo Abeni <pabeni@...hat.com>
Cc: davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
horms@...nel.org, bigeasy@...utronix.de, tgraf@...g.ch, paulmck@...nel.org,
netdev@...r.kernel.org, bpf@...r.kernel.org,
Martin KaFai Lau <martin.lau@...ux.dev>
Subject: Re: [PATCH v2] net/cls_cgroup: Fix task_get_classid() during qdisc
run
On 8/29/25 5:23 AM, Yafang Shao wrote:
> On Thu, Aug 28, 2025 at 3:55 PM Paolo Abeni <pabeni@...hat.com> wrote:
>> On 8/22/25 8:42 AM, Yafang Shao wrote:
>>> During recent testing with the netem qdisc to inject delays into TCP
>>> traffic, we observed that our CLS BPF program failed to function correctly
>>> due to incorrect classid retrieval from task_get_classid(). The issue
>>> manifests in the following call stack:
>>>
>>> bpf_get_cgroup_classid+5
>>> cls_bpf_classify+507
>>> __tcf_classify+90
>>> tcf_classify+217
>>> __dev_queue_xmit+798
>>> bond_dev_queue_xmit+43
>>> __bond_start_xmit+211
>>> bond_start_xmit+70
>>> dev_hard_start_xmit+142
>>> sch_direct_xmit+161
>>> __qdisc_run+102 <<<<< Issue location
>>> __dev_xmit_skb+1015
>>> __dev_queue_xmit+637
>>> neigh_hh_output+159
>>> ip_finish_output2+461
>>> __ip_finish_output+183
>>> ip_finish_output+41
>>> ip_output+120
>>> ip_local_out+94
>>> __ip_queue_xmit+394
>>> ip_queue_xmit+21
>>> __tcp_transmit_skb+2169
>>> tcp_write_xmit+959
>>> __tcp_push_pending_frames+55
>>> tcp_push+264
>>> tcp_sendmsg_locked+661
>>> tcp_sendmsg+45
>>> inet_sendmsg+67
>>> sock_sendmsg+98
>>> sock_write_iter+147
>>> vfs_write+786
>>> ksys_write+181
>>> __x64_sys_write+25
>>> do_syscall_64+56
>>> entry_SYSCALL_64_after_hwframe+100
>>>
>>> The problem occurs when multiple tasks share a single qdisc. In such cases,
>>> __qdisc_run() may transmit skbs created by different tasks. Consequently,
>>> task_get_classid() retrieves an incorrect classid since it references the
>>> current task's context rather than the skb's originating task.
>>>
>>> Given that dev_queue_xmit() always executes with bh disabled, we can safely
>>> use in_softirq() instead of in_serving_softirq() to properly identify the
>>> softirq context and obtain the correct classid.
>>>
>>> The simple steps to reproduce this issue:
>>> 1. Add network delay to the network interface:
>>> such as: tc qdisc add dev bond0 root netem delay 1.5ms
>>> 2. Create two distinct net_cls cgroups, each running a network-intensive task
>>> 3. Initiate parallel TCP streams from both tasks to external servers.
>>>
>>> Under this specific condition, the issue reliably occurs. The kernel
>>> eventually dequeues an SKB that originated from Task-A while executing in
>>> the context of Task-B.
>>>
>>> Signed-off-by: Yafang Shao <laoar.shao@...il.com>
>>> Cc: Daniel Borkmann <daniel@...earbox.net>
>>> Cc: Thomas Graf <tgraf@...g.ch>
>>> Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
>>>
>>> v1->v2: use softirq_count() instead of in_softirq()
>>> ---
>>> include/net/cls_cgroup.h | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
>>> index 7e78e7d6f015..668aeee9b3f6 100644
>>> --- a/include/net/cls_cgroup.h
>>> +++ b/include/net/cls_cgroup.h
>>> @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
>>> * calls by looking at the number of nested bh disable calls because
>>> * softirqs always disables bh.
>>> */
>>> - if (in_serving_softirq()) {
>>> + if (softirq_count()) {
>>> struct sock *sk = skb_to_full_sk(skb);
>>>
>>> /* If there is an sock_cgroup_classid we'll use that. */
>>
>> AFAICS the above changes the established behavior for a slightly
>> different scenario:
>
> right.
>
>> <sock S is created by task A>
>> <class ID for task A is changed>
>> <skb is created by sock S xmit and classified>
>>
>> prior to this patch the skb will be classified with the 'new' task A
>> classid, now with the old/original one.
>>
>> I'm unsure if such behavior change is acceptable;
>
> The classid of a skb is only meaningful within its original network
> context, not from a random task.
Do you mean by original network context original netns? We also have
bpf_skb_cgroup_classid() as well as bpf_get_cgroup_classid_curr(), both
exposed to tcx, which kind of detangles what task_get_classid() is doing.
I guess if you have apps in its own netns and the skb->sk is retained all
the way to phys dev in hostns then bpf_skb_cgroup_classid() might be a
better choice (assuming classid stays constant from container orchestrator
PoV).
>> I think at very least
>> it should be mentioned in the changelog and likely this change should
>> target net-next.
>
> Will add this to the commit log and tag it for net-next in the next version.
Powered by blists - more mailing lists