[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f9c1e53b-7956-44d2-8d8c-20dfd1671242@163.com>
Date: Thu, 10 Jul 2025 10:36:46 +0800
From: luyun <luyun_611@....com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, davem@...emloft.net,
edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 2/2] af_packet: fix soft lockup issue caused by
tpacket_snd()
在 2025/7/10 05:14, Willem de Bruijn 写道:
> Yun Lu wrote:
>> From: Yun Lu <luyun@...inos.cn>
>>
>> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
>> pending_refcnt to decrement to zero before returning. The pending_refcnt
>> is decremented by 1 when the skb->destructor function is called,
>> indicating that the skb has been successfully sent and needs to be
>> destroyed.
>>
>> If an error occurs during this process, the tpacket_snd() function will
>> exit and return error, but pending_refcnt may not yet have decremented to
>> zero. Assuming the next send operation is executed immediately, but there
>> are no available frames to be sent in tx_ring (i.e., packet_current_frame
>> returns NULL), and skb is also NULL
> This is a very specific edge case. And arguably the goal is to wait
> for any pending skbs still, even if from a previous call.
>
> skb is true for all but the first iterations of that loop. So your
> earlier patch
>
> - if (need_wait && skb) {
> + if (need_wait && packet_read_pending(&po->tx_ring)) {
>
> Is more concise and more obviously correct.
>
>> , the function will not execute
>> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
>> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
>> if the previous skb has completed transmission, the skb->destructor
>> function can only be invoked in the ksoftirqd thread (assuming NAPI
>> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
>> operation happen to run on the same CPU, and the CPU trapped in the
>> do-while loop without yielding, the ksoftirqd thread will not get
>> scheduled to run.
> Interestingly, this is quite similar to the issue that caused adding
> the completion in the first place. Commit 89ed5b519004 ("af_packet:
> Block execution of tasks waiting for transmit to complete in
> AF_PACKET") added the completion because a SCHED_FIFO task could delay
> ksoftirqd indefinitely.
>
>> As a result, pending_refcnt will never be reduced to
>> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
>> lockup issue.
>>
>> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
>> wait_for_completion_interruptible_timeout() should be executed to yield
>> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, move
>> the penging_refcnt check to the start of the do-while loop, and reuse ph
>> to continue for the next iteration.
>>
>> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
>> Cc: stable@...nel.org
>> Suggested-by: LongJun Tang <tanglongjun@...inos.cn>
>> Signed-off-by: Yun Lu <luyun@...inos.cn>
>>
>> ---
>> Changes in v3:
>> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
>> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
> If the fix alone is more obvious without this optimization, and
> the extra packet_read_pending() is already present, not newly
> introduced with the fix, then I would prefer to split the fix (to net,
> and stable) from the optimization (to net-next).
Alright, referring to your suggestion, I will split this patch into two
for the next version: one to fix the issue (as the first version, to
net, and stable), and the other to optimize the code (to net-next).
Thanks for your review and suggestion.
>
>> Changes in v2:
>> - Add a Fixes tag.
>> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
>> ---
>> net/packet/af_packet.c | 21 ++++++++++++---------
>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 7089b8c2a655..89a5d2a3a720 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2846,11 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>> ph = packet_current_frame(po, &po->tx_ring,
>> TP_STATUS_SEND_REQUEST);
>> if (unlikely(ph == NULL)) {
>> - if (need_wait && skb) {
>> + /* Note: packet_read_pending() might be slow if we
>> + * have to call it as it's per_cpu variable, but in
>> + * fast-path we don't have to call it, only when ph
>> + * is NULL, we need to check pending_refcnt.
>> + */
>> + if (need_wait && packet_read_pending(&po->tx_ring)) {
>> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>> if (timeo <= 0) {
>> err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
>> goto out_put;
>> + } else {
>> + /* Just reuse ph to continue for the next iteration, and
>> + * ph will be reassigned at the start of the next iteration.
>> + */
>> + ph = (void *)1;
>> }
>> }
>> /* check for additional frames */
>> @@ -2943,14 +2953,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>> }
>> packet_increment_head(&po->tx_ring);
>> len_sum += tp_len;
>> - } while (likely((ph != NULL) ||
>> - /* Note: packet_read_pending() might be slow if we have
>> - * to call it as it's per_cpu variable, but in fast-path
>> - * we already short-circuit the loop with the first
>> - * condition, and luckily don't have to go that path
>> - * anyway.
>> - */
>> - (need_wait && packet_read_pending(&po->tx_ring))));
>> + } while (likely(ph != NULL))
>>
>> err = len_sum;
>> goto out_put;
>> --
>> 2.43.0
>>
Powered by blists - more mailing lists