[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca2f65f5-0ec3-4bd2-af8b-530b9feff2ca@163.com>
Date: Wed, 9 Jul 2025 14:31:16 +0800
From: luyun <luyun_611@....com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: willemdebruijn.kernel@...il.com, davem@...emloft.net, kuba@...nel.org,
pabeni@...hat.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] af_packet: fix soft lockup issue caused by
tpacket_snd()
在 2025/7/8 15:12, Eric Dumazet 写道:
> On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@....com> wrote:
>> From: Yun Lu <luyun@...inos.cn>
>>
>> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
>> pending_refcnt to decrement to zero before returning. The pending_refcnt
>> is decremented by 1 when the skb->destructor function is called,
>> indicating that the skb has been successfully sent and needs to be
>> destroyed.
>>
>> If an error occurs during this process, the tpacket_snd() function will
>> exit and return error, but pending_refcnt may not yet have decremented to
>> zero. Assuming the next send operation is executed immediately, but there
>> are no available frames to be sent in tx_ring (i.e., packet_current_frame
>> returns NULL), and skb is also NULL, the function will not execute
>> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
>> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
>> if the previous skb has completed transmission, the skb->destructor
>> function can only be invoked in the ksoftirqd thread (assuming NAPI
>> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
>> operation happen to run on the same CPU, and the CPU trapped in the
>> do-while loop without yielding, the ksoftirqd thread will not get
>> scheduled to run. As a result, pending_refcnt will never be reduced to
>> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
>> lockup issue.
>>
>> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
>> wait_for_completion_interruptible_timeout() should be executed to yield
>> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
>> execution condition of this function should be modified to check if
>> pending_refcnt is not zero.
>>
>> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
>> Suggested-by: LongJun Tang <tanglongjun@...inos.cn>
>> Signed-off-by: Yun Lu <luyun@...inos.cn>
>>
>> ---
>> Changes in v2:
>> - Add a Fixes tag.
>> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
>> ---
>> net/packet/af_packet.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 3d43f3eae759..7df96311adb8 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>> ph = packet_current_frame(po, &po->tx_ring,
>> TP_STATUS_SEND_REQUEST);
>> if (unlikely(ph == NULL)) {
>> - if (need_wait && skb) {
>> + if (need_wait && packet_read_pending(&po->tx_ring)) {
>> timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>> if (timeo <= 0) {
> packet_read_pending() is super expensive on hosts with 256 cpus (or more)
>
> We are going to call it a second time at the end of the block:
>
> do { ...
> } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...
>
> Perhaps we can remove the second one ?
As mentioned in the commit message, the soft lockup issue only occurs
when tpacket_snd() is called to send, with the pending_refcnt is
non-zero, and there are no available packets in the tx_ring.
Therefore, at the first start of the loop, packet_read_pending() only
needs to be called once. If the return result is already 0, the loop can
exit directly. Otherwise, wait_for_completion_interruptible_timeout()
needs to be executed for waiting. Later, this function should only be
called at the end to check whether the loop can exit.
>
> Also I think there is another problem with the code.
>
> We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
> constraint could be way off.
Yes, due to the changes in commit 581073f626e3 ("af_packet: do not call
packet_read_pending() from tpacket_destruct_skb()"), every time
tpacket_destruct_skb is executed, the skb_completion is marked as
completed. When wait_for_completion_interruptible_timeout returns
completed, the pending_refcnt has not yet been reduced to 0. Therefore,
when ph is NULL, the wait function may need to be called multiple times
untill packet_read_pending finally returns 0.
I have revised the code (as shown below), and it seems to be more
reasonable. I also look forward to any better suggestions you may have.
Thank you!
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7df96311adb8..401ae8f6481b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,7 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
int len_sum = 0;
int status = TP_STATUS_AVAILABLE;
int hlen, tlen, copylen = 0;
- long timeo = 0;
+ long timeo;
+ unsigned int pending;
+ bool first = true;
mutex_lock(&po->pg_vec_lock);
@@ -2839,18 +2841,27 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
size_max = dev->mtu + reserve + VLAN_HLEN;
+ timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
reinit_completion(&po->skb_completion);
do {
ph = packet_current_frame(po, &po->tx_ring,
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
- if (need_wait &&
packet_read_pending(&po->tx_ring)) {
- timeo = sock_sndtimeo(&po->sk,
msg->msg_flags & MSG_DONTWAIT);
- timeo =
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
- if (timeo <= 0) {
- err = !timeo ? -ETIMEDOUT :
-ERESTARTSYS;
- goto out_put;
+ if (need_wait) {
+ if (skb == NULL && fisrt) {
+ pending =
packet_read_pending(&po->tx_ring);
+ if (!pending)
+ goto out_put;
+ else
+ first = false;
+ }
+ if (skb || pending) {
+ timeo =
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
+ if (timeo <= 0) {
+ err = !timeo ?
-ETIMEDOUT : -ERESTARTSYS;
+ goto out_put;
+ }
}
}
/* check for additional frames */
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
> 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
> int len_sum = 0;
> int status = TP_STATUS_AVAILABLE;
> int hlen, tlen, copylen = 0;
> - long timeo = 0;
> + long timeo;
>
> + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> mutex_lock(&po->pg_vec_lock);
>
> /* packet_sendmsg() check on tx_ring.pg_vec was lockless,
> @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
> TP_STATUS_SEND_REQUEST);
> if (unlikely(ph == NULL)) {
> if (need_wait && skb) {
> - timeo = sock_sndtimeo(&po->sk,
> msg->msg_flags & MSG_DONTWAIT);
> timeo =
> wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> if (timeo <= 0) {
> err = !timeo ? -ETIMEDOUT :
> -ERESTARTSYS;
Powered by blists - more mailing lists