linux-kernel - Re: [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <515fc9c6-a4a2-4fdf-8d91-396e42c95767@163.com>
Date: Fri, 11 Jul 2025 15:20:20 +0800
From: luyun <luyun_611@....com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, davem@...emloft.net,
 edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, horms@...nel.org
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 2/3] af_packet: fix soft lockup issue caused by
 tpacket_snd()


在 2025/7/10 21:49, Willem de Bruijn 写道:
> Yun Lu wrote:
>> From: Yun Lu <luyun@...inos.cn>
>>
>> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
>> pending_refcnt to decrement to zero before returning. The pending_refcnt
>> is decremented by 1 when the skb->destructor function is called,
>> indicating that the skb has been successfully sent and needs to be
>> destroyed.
>>
>> If an error occurs during this process, the tpacket_snd() function will
>> exit and return error, but pending_refcnt may not yet have decremented to
>> zero. Assuming the next send operation is executed immediately, but there
>> are no available frames to be sent in tx_ring (i.e., packet_current_frame
>> returns NULL), and skb is also NULL, the function will not execute
>> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
>> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
>> if the previous skb has completed transmission, the skb->destructor
>> function can only be invoked in the ksoftirqd thread (assuming NAPI
>> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
>> operation happen to run on the same CPU, and the CPU trapped in the
>> do-while loop without yielding, the ksoftirqd thread will not get
>> scheduled to run. As a result, pending_refcnt will never be reduced to
>> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
>> lockup issue.
>>
>> In fact, skb is true for all but the first iterations of that loop, and
>> as long as pending_refcnt is not zero, even if incremented by a previous
>> call, wait_for_completion_interruptible_timeout() should be executed to
>> yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
>> the execution condition of this function should be modified to check if
>> pending_refcnt is not zero, instead of check skb.
>>
>> As a result, packet_read_pending() may be called twice in the loop. This
>> will be optimized in the following patch.
>>
>> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
>> Cc: stable@...nel.org
>> Suggested-by: LongJun Tang <tanglongjun@...inos.cn>
>> Signed-off-by: Yun Lu <luyun@...inos.cn>
>>
>> ---
>> Changes in v4:
>> - Split to the fix alone. Thanks: Willem de Bruijn.
>> - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/
>>
>> Changes in v3:
>> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
>> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
>>
>> Changes in v2:
>> - Add a Fixes tag.
>> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
>> ---
>> ---
>>   net/packet/af_packet.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 7089b8c2a655..581a96ec8e1a 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>   		ph = packet_current_frame(po, &po->tx_ring,
>>   					  TP_STATUS_SEND_REQUEST);
>>   		if (unlikely(ph == NULL)) {
>> -			if (need_wait && skb) {
>> +			if (need_wait && packet_read_pending(&po->tx_ring)) {
> Unfortunately I did not immediately fully appreciate Eric's
> suggestion.
>
> My comments was
>
>      If [..] the extra packet_read_pending() is already present, not
>      newly introduced with the fix
>
> But of course that expensive call is newly introduced, so my
> suggestion was invalid.
>
> It's btw also not possible to mix net and net-next patches in a single
> series like this (see Documentation/process/maintainer-netdev.rst).

Sorry, I misunderstood your comments. In the next version, I will 
combine the second and third patches together.

>
> But, instead of going back entirely to v2, perhaps we can make the
> logic a bit more obvious by just having a while (1) at the end to show
> that the only way to exit the loop (except errors) is in the ph == NULL
> branch. And break in that loop directly.
>
> There are two other ways to reach that while statement. A continue
> on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph
> is non-zero, so the condition is true anyway.

Following your suggestion, I tried modifying the code (as shown below),  
now the loop condition is still the same as origin, but the logic is now 
clearer and more obvious.

Thanks.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7089b8c2a655..be608f07441f 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2846,15 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
                 ph = packet_current_frame(po, &po->tx_ring,
                                           TP_STATUS_SEND_REQUEST);
                 if (unlikely(ph == NULL)) {
-                       if (need_wait && skb) {
+                       /* Note: packet_read_pending() might be slow if we
+                        * have to call it as it's per_cpu variable, but in
+                        * fast-path we don't have to call it, only when ph
+                        * is NULL, we need to check the pending_refcnt.
+                        */
+                       if (need_wait && 
packet_read_pending(&po->tx_ring)) {
                                 timeo = 
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
                                 if (timeo <= 0) {
                                         err = !timeo ? -ETIMEDOUT : 
-ERESTARTSYS;
                                         goto out_put;
                                 }
-                       }
-                       /* check for additional frames */
-                       continue;
+                               /* check for additional frames */
+                               continue;
+                       } else
+                               break;
                 }

                 skb = NULL;
@@ -2943,14 +2949,7 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
                 }
                 packet_increment_head(&po->tx_ring);
                 len_sum += tp_len;
-       } while (likely((ph != NULL) ||
-               /* Note: packet_read_pending() might be slow if we have
-                * to call it as it's per_cpu variable, but in fast-path
-                * we already short-circuit the loop with the first
-                * condition, and luckily don't have to go that path
-                * anyway.
-                */
-                (need_wait && packet_read_pending(&po->tx_ring))));
+       } while (1);

         err = len_sum;
         goto out_put;



>
>>   				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>>   				if (timeo <= 0) {
>>   					err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
>> -- 
>> 2.43.0
>>