netdev - Re: [PATCH] tcp: check socket state before calling WARN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQynUspJL4e3UnZTKps9WmgnE-0ngQnQmn=8gjSmyg4fQ5A@mail.gmail.com>
Date: Tue, 3 Dec 2024 10:34:46 -0500
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Youngmin Nam <youngmin.nam@...sung.com>, davem@...emloft.net, dsahern@...nel.org, 
	kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, dujeong.lee@...sung.com, 
	guo88.liu@...sung.com, yiwang.cai@...sung.com, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, joonki.min@...sung.com, hajun.sung@...sung.com, 
	d7271.choe@...sung.com, sw.ju@...sung.com
Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON

On Tue, Dec 3, 2024 at 6:07 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@...sung.com> wrote:
> >
> > We encountered the following WARNINGs
> > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > which triggered a kernel panic due to panic_on_warn.
> >
> > case 1.
> > ------------[ cut here ]------------
> > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > Call trace:
> >  tcp_sacktag_write_queue+0xae8/0xb60
> >  tcp_ack+0x4ec/0x12b8
> >  tcp_rcv_state_process+0x22c/0xd38
> >  tcp_v4_do_rcv+0x220/0x300
> >  tcp_v4_rcv+0xa5c/0xbb4
> >  ip_protocol_deliver_rcu+0x198/0x34c
> >  ip_local_deliver_finish+0x94/0xc4
> >  ip_local_deliver+0x74/0x10c
> >  ip_rcv+0xa0/0x13c
> > Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
> > case 2.
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > Call trace:
> >  tcp_fastretrans_alert+0x8ac/0xa74
> >  tcp_ack+0x904/0x12b8
> >  tcp_rcv_state_process+0x22c/0xd38
> >  tcp_v4_do_rcv+0x220/0x300
> >  tcp_v4_rcv+0xa5c/0xbb4
> >  ip_protocol_deliver_rcu+0x198/0x34c
> >  ip_local_deliver_finish+0x94/0xc4
> >  ip_local_deliver+0x74/0x10c
> >  ip_rcv+0xa0/0x13c
> > Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
>
> I have not seen these warnings firing. Neal, have you seen this in the past ?

I can't recall seeing these warnings over the past 5 years or so, and
(from checking our monitoring) they don't seem to be firing in our
fleet recently.

> In any case this test on sk_state is too specific.

I agree with Eric. IMHO TCP_FIN_WAIT1 deserves all the same warnings
as ESTABLISHED, since in this state the connection may still have a
big queue of data it is trying to reliably send to the other side,
with full loss recovery and congestion control logic.

I would suggest that instead of running with panic_on_warn it would
make more sense to not panic on warning, and instead add more detail
to these warning messages in your kernels during your testing, to help
debug what is going wrong. I would suggest adding to the warning
message:

tp->packets_out
tp->sacked_out
tp->lost_out
tp->retrans_out
tcp_is_sack(tp)
tp->mss_cache
inet_csk(sk)->icsk_ca_state
inet_csk(sk)->icsk_pmtu_cookie

A hunch would be that this is either firing for (a) non-SACK
connections, or (b) after an MTU reduction.

In particular, you might try `echo 0 >
/proc/sys/net/ipv4/tcp_mtu_probing` and see if that makes the warnings
go away.

cheers,
neal