linux-kernel - Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89i+6g+VwByu-xeJ-PVuaw8X_yQdC2buB7q=YO5S3MzMTUw@mail.gmail.com>
Date: Tue, 8 Jul 2025 00:12:39 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Yun Lu <luyun_611@....com>
Cc: willemdebruijn.kernel@...il.com, davem@...emloft.net, kuba@...nel.org, 
	pabeni@...hat.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@....com> wrote:
>
> From: Yun Lu <luyun@...inos.cn>
>
> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
> pending_refcnt to decrement to zero before returning. The pending_refcnt
> is decremented by 1 when the skb->destructor function is called,
> indicating that the skb has been successfully sent and needs to be
> destroyed.
>
> If an error occurs during this process, the tpacket_snd() function will
> exit and return error, but pending_refcnt may not yet have decremented to
> zero. Assuming the next send operation is executed immediately, but there
> are no available frames to be sent in tx_ring (i.e., packet_current_frame
> returns NULL), and skb is also NULL, the function will not execute
> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
> if the previous skb has completed transmission, the skb->destructor
> function can only be invoked in the ksoftirqd thread (assuming NAPI
> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
> operation happen to run on the same CPU, and the CPU trapped in the
> do-while loop without yielding, the ksoftirqd thread will not get
> scheduled to run. As a result, pending_refcnt will never be reduced to
> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
> lockup issue.
>
> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
> wait_for_completion_interruptible_timeout() should be executed to yield
> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
> execution condition of this function should be modified to check if
> pending_refcnt is not zero.
>
> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
> Suggested-by: LongJun Tang <tanglongjun@...inos.cn>
> Signed-off-by: Yun Lu <luyun@...inos.cn>
>
> ---
> Changes in v2:
> - Add a Fixes tag.
> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
> ---
>  net/packet/af_packet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3d43f3eae759..7df96311adb8 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>                 ph = packet_current_frame(po, &po->tx_ring,
>                                           TP_STATUS_SEND_REQUEST);
>                 if (unlikely(ph == NULL)) {
> -                       if (need_wait && skb) {
> +                       if (need_wait && packet_read_pending(&po->tx_ring)) {
>                                 timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>                                 timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>                                 if (timeo <= 0) {

packet_read_pending() is super expensive on hosts with 256 cpus (or more)

We are going to call it a second time at the end of the block:

do { ...
} while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...

Perhaps we can remove the second one ?

Also I think there is another problem with the code.

We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
constraint could be way off.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
        int len_sum = 0;
        int status = TP_STATUS_AVAILABLE;
        int hlen, tlen, copylen = 0;
-       long timeo = 0;
+       long timeo;

+       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
        mutex_lock(&po->pg_vec_lock);

        /* packet_sendmsg() check on tx_ring.pg_vec was lockless,
@@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
                                          TP_STATUS_SEND_REQUEST);
                if (unlikely(ph == NULL)) {
                        if (need_wait && skb) {
-                               timeo = sock_sndtimeo(&po->sk,
msg->msg_flags & MSG_DONTWAIT);
                                timeo =
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
                                if (timeo <= 0) {
                                        err = !timeo ? -ETIMEDOUT :
-ERESTARTSYS;