linux-kernel - Re: [PATCH] tcp: check socket state before calling WARN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQykpHsN1rPJobKVfFGwtAJ9qwPrwG21HiunHqfykxyPD1g@mail.gmail.com>
Date: Mon, 24 Feb 2025 16:13:05 -0500
From: Neal Cardwell <ncardwell@...gle.com>
To: Youngmin Nam <youngmin.nam@...sung.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, davem@...emloft.net, 
	dsahern@...nel.org, pabeni@...hat.com, horms@...nel.org, 
	guo88.liu@...sung.com, yiwang.cai@...sung.com, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, joonki.min@...sung.com, hajun.sung@...sung.com, 
	d7271.choe@...sung.com, sw.ju@...sung.com, 
	"Dujeong.lee" <dujeong.lee@...sung.com>, Yuchung Cheng <ycheng@...gle.com>, Kevin Yang <yyd@...gle.com>
Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON

On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@...sung.com> wrote:
>
> > Hi Neal,
> > Thank you for looking into this issue.
> > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > We can provide any information you would like to inspect.

Thanks again for raising this issue, and providing all that data!

I've come up with a reproducer for this issue, and an explanation for
why this has only been seen on Android so far, and a theory about a
related socket leak issue, and a proposed fix for the WARN and the
socket leak.

Here is the scenario:

+ user process A has a socket in TCP_ESTABLISHED

+ user process A calls close(fd)

+ socket calls __tcp_close() and tcp_close_state() decides to enter
TCP_FIN_WAIT1 and send a FIN

+ FIN is lost and retransmitted, making the state:
---
 tp->packets_out = 1
 tp->sacked_out = 0
 tp->lost_out = 1
 tp->retrans_out = 1
---

+ someone invokes "ss" to --kill the socket using the functionality in
(1e64e298b8 "net: diag: Support destroying TCP sockets")

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a

 (note: this was added for Android, so would not be surprising to have
this inet_diag --kill run on Android)

+ the ss --kill causes a call to tcp_abort()

+ tcp_abort() calls tcp_write_queue_purge()

+ tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
retrans_out=1

+ tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state

+ ACK arrives and causes a WARN_ON from tcp_verify_left_out():

#define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)

because the state has:

 ---
 tcp_left_out(tp) = sacked_out + lost_out = 1
  tp->packets_out = 0
---

because the state is:

---
 tp->packets_out = 0
 tp->sacked_out = 0
 tp->lost_out = 1
 tp->retrans_out = 1
---

I guess perhaps one fix would be to just have tcp_write_queue_purge()
zero out those other fields:

---
 tp->sacked_out = 0
 tp->lost_out = 0
 tp->retrans_out = 0
---

However, there is a related and worse problem. Because this killed
socket has tp->packets_out, the next time the RTO timer fires,
tcp_retransmit_timer() notices !tp->packets_out is true, so it short
circuits and returns without setting another RTO timer or checking to
see if the socket should be deleted. So the tcp_sock is now sitting in
memory with no timer set to delete it. So we could leak a socket this
way. So AFAICT to fix this socket leak problem, perhaps we want a
patch like the following (not tested yet), so that we delete all
killed sockets immediately, whether they are SOCK_DEAD (orphans for
which the user already called close() or not) :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 28cf19317b6c2..a266078b8ec8c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
        local_bh_disable();
        bh_lock_sock(sk);

-       if (!sock_flag(sk, SOCK_DEAD)) {
-               if (tcp_need_reset(sk->sk_state))
-                       tcp_send_active_reset(sk, GFP_ATOMIC);
-               tcp_done_with_error(sk, err);
-       }
+       if (tcp_need_reset(sk->sk_state))
+               tcp_send_active_reset(sk, GFP_ATOMIC);
+       tcp_done_with_error(sk, err);

        bh_unlock_sock(sk);
        local_bh_enable();
-       tcp_write_queue_purge(sk);
        release_sock(sk);
        return 0;
 }
---

Here is a packetdrill script that reproduces a scenario similar to that:

---  gtests/net/tcp/inet_diag/inet-diag-fin-wait-1-retrans-kill.pkt
// Test SOCK_DESTROY on TCP_FIN_WAIT1 sockets
// We use the "ss" socket statistics tool, which uses inet_diag sockets.

// ss -K can be slow
--tolerance_usecs=15000

// Set up config.
`../common/defaults.sh`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 2>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 32890

   +0 accept(3, ..., ...) = 4

// Send 4 data segments.
   +0 write(4, ..., 4000) = 4000
   +0 > P. 1:4001(4000) ack 1

   +0 close(4) = 0
   +0 > F. 4001:4001(0) ack 1

// In TCP_FIN_WAIT1 now...

// Send FIN as a TLP probe at 2*srtt:
+.200 > F. 4001:4001(0) ack 1

// Retransmit head.
+.300 > . 1:1001(1000) ack 1
   +0 `ss -tinmo src :8080`

// Test what happens when we ss --kill a socket in TCP_FIN_WAIT1.
// ss --kill is scary! Don't mess with the filter or risk killing many flows!
   +0 `ss -t --kill -n src :8080 `

   +0 `echo check what is left...; ss -tinmo src :8080`

// An ACK arrives that carries a SACK block and
// makes us call tcp_verify_left_out(tp) to WARN about the inconsistency:
+.010 < . 1:1(0) ack 1 win 32890 <sack 1001:2001,nop,nop>

   +0 `echo after SACK; ss -tinmo src :8080`
---

That script triggers one of the warning cases mentioned in the netdev
email thread (see below).

Note that when I extend the packetdrill script above to expect another
RTO retransmission, that retransmission never happens, which AFAICT
supports my theory about the socket leak issue.

best regards,
neal

---
ps: here is the warning triggered by the packetdrill script above

[412967.317794] ------------[ cut here ]------------
[412967.317801] WARNING: CPU: 109 PID: 865840 at
net/ipv4/tcp_input.c:2141 tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317805] Modules linked in: dummy act_mirred sch_netem ifb
bridge stp llc vfat fat i2c_mux_pca954x i2c_mux gq sha3_generic
spi_lewisburg_pch spidev cdc_acm google_bmc_lpc google_bmc_mailbox
xhci_pci xhci_hcd i2c_iimc
[412967.317818] CPU: 109 PID: 865840 Comm: packetdrill Kdump: loaded
Tainted: G S               N 5.10.0-smp-1300.91.890.1 #6
[412967.317820] Hardware name: Google LLC Indus/Indus_QC_00, BIOS
30.60.4 02/23/2023
[412967.317821] RIP: 0010:tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317824] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 0f
0b 41 83 be b0 06 00 00 00 79 a0 0f 0b eb 9c 0f 0b 41 8b 86 d0 06 00
00 eb 9c <0f> 0b eb b3 0f 0b e9 ac fe ff ff b8 ff ff ff ff e9 39 ff ff
ff e8
[412967.317826] RSP: 0018:ffff999ff84b9770 EFLAGS: 00010286
[412967.317827] RAX: 0000000000000001 RBX: 0000000000000001 RCX:
0000000000000005
[412967.317828] RDX: 00000000fffffffc RSI: ffff99a04bd7d200 RDI:
ffff99a07956e1c8
[412967.317830] RBP: ffff999ff84b9830 R08: ffff99a04bd7d200 R09:
00000000f82c8a14
[412967.317831] R10: ffff999f8b67dc54 R11: ffff999f8b67dc00 R12:
0000000000000054
[412967.317832] R13: 0000000000000001 R14: ffff99a07956e040 R15:
0000000000000014
[412967.317833] FS:  00007f4cf48dc740(0000) GS:ffff99cebfc80000(0000)
knlGS:0000000000000000
[412967.317834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[412967.317835] CR2: 00000000085e9480 CR3: 000000309df36006 CR4:
00000000007726f0
[412967.317836] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[412967.317837] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[412967.317838] PKRU: 55555554
[412967.317839] Call Trace:
[412967.317845]  ? tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317848]  ? __warn+0x195/0x2a0
[412967.317853]  ? tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317869]  ? report_bug+0xe6/0x150
[412967.317872]  ? handle_bug+0x4c/0x90
[412967.317878]  ? exc_invalid_op+0x3a/0x110
[412967.317881]  ? asm_exc_invalid_op+0x12/0x20
[412967.317885]  ? tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317889]  tcp_ack+0x5c8/0x1a90
[412967.317893]  ? prep_new_page+0x81/0xe0
[412967.317897]  ? prep_new_page+0x41/0xe0
[412967.317900]  ? get_page_from_freelist+0x1556/0x15b0
[412967.317904]  tcp_rcv_state_process+0x2cd/0xe00
[412967.317908]  tcp_v4_do_rcv+0x2ac/0x370
[412967.317913]  tcp_v4_rcv+0x9b8/0xab0
[412967.317916]  ip_protocol_deliver_rcu+0x71/0x110
[412967.317921]  ip_local_deliver+0xa8/0x130
[412967.317925]  ? inet_rtm_getroute+0x191/0x910
[412967.317928]  ? ip_local_deliver+0x130/0x130
[412967.317931]  ip_rcv+0x41/0xd0
[412967.317934]  ? ip_rcv_core+0x300/0x300
[412967.317937]  __netif_receive_skb+0x9e/0x160
[412967.317942]  netif_receive_skb+0x2c/0x130
[412967.317945]  tun_rx_batched+0x17b/0x1e0
[412967.317950]  tun_get_user+0xe39/0xff0
[412967.317953]  ? __switch_to_asm+0x3a/0x60
[412967.317958]  tun_chr_write_iter+0x57/0x80
[412967.317961]  do_iter_readv_writev+0x143/0x180
[412967.317967]  do_iter_write+0x8b/0x1d0
[412967.317970]  vfs_writev+0x96/0x130
[412967.317974]  do_writev+0x6b/0x100
[412967.317978]  do_syscall_64+0x6d/0xa0
[412967.317981]  entry_SYSCALL_64_after_hwframe+0x67/0xd1
[412967.317985] RIP: 0033:0x7f4cf4a10885