netdev - Re: [PATCH net-next] tcp: avoid RST in 3-way shakehands due to failure in tcp_timewait_state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBVbL+tRsHZpLQro1gEoAUn11ZWG3=AL-5XxFVCAxZ5fg@mail.gmail.com>
Date: Tue, 5 Nov 2024 19:48:12 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>, Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, davem@...emloft.net, dsahern@...nel.org, 
	horms@...nel.org, kernelxing@...cent.com, kuba@...nel.org, 
	netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH net-next] tcp: avoid RST in 3-way shakehands due to
 failure in tcp_timewait_state_process

Hello Eric, Martin

On Tue, Nov 5, 2024 at 5:50 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Tue, Nov 5, 2024 at 10:42 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Tue, Nov 5, 2024 at 5:35 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, Nov 5, 2024 at 10:08 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@...il.com>
> > > > > Date: Tue,  5 Nov 2024 10:55:11 +0800
> > > > > > From: Jason Xing <kernelxing@...cent.com>
> > > > > >
> > > > > > We found there are rare chances that some RST packets appear during
> > > > > > the shakehands because the timewait socket cannot accept the SYN and
> > > > >
> > > > > s/shakehands/handshake/
> > > > >
> > > > > same in the subject.
> > > > >
> > > > > > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > >
> > > > > > Here is how things happen in production:
> > > > > > Time        Client(A)        Server(B)
> > > > > > 0s          SYN-->
> > > > > > ...
> > > > > > 132s                         <-- FIN
> > > > > > ...
> > > > > > 169s        FIN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        SYN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        RST-->
> > > > > > As above picture shows, the two flows have a start time difference
> > > > > > of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > > TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > > that soon is reset by itself due to receiving a ACK.
> > > > > >
> > > > > > There are two key checks in tcp_timewait_state_process() when timewait
> > > > > > socket in B receives the SYN packet:
> > > > > > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > >
> > > > > > Regarding the first rule, it fails as expected because in the first
> > > > > > connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > > passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > >
> > > > > > Then how about the second rule?
> > > > > > It fails again!
> > > > > > Let's take a look at how the tsval comes out:
> > > > > > __tcp_transmit_skb()
> > > > > >     -> tcp_syn_options()
> > > > > >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > > other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > > to care about it. If both operations (sending FIN and then starting
> > > > > > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > > with millisecond precision.
> > > > > >
> > > > > > Based on the above analysis, I decided to make a small change to
> > > > > > the check in tcp_timewait_state_process() so that the second flow
> > > > > > would not fail.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp_minisocks.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> > > > > >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> > > > > >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> > > > > >            (tmp_opt.saw_tstamp &&
> > > > > > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> > > > >
> > > > > I think this follows RFC 6191 and such a change requires a formal
> > > > > discussion at IETF.
> > > > >
> > > > > https://datatracker.ietf.org/doc/html/rfc6191#section-2
> > > > >
> > > > > ---8<---
> > > > >       *  If TCP Timestamps would be enabled for the new incarnation of
> > > > >          the connection, and the timestamp contained in the incoming SYN
> > > > >          segment is greater than the last timestamp seen on the previous
> > > >
> > > > The true thing is that the timestamp of the SYN packet is greater than
> > > > that of the last packet, but the kernel implementation uses ms
> > > > precision (please see tcp_skb_timestamp_ts()). That function makes
> > > > those two timestamps the same.
> > > >
> > > > This case happens as expected, so the second connection should be
> > > > established. My confusion just popped out of my mind: what rules
> > > > should we follow to stop the second flow?
> > >
> > > Note that linux TCP stack can use usec resolution for TCP TS values.
> > >
> > > You might adopt it, and no longer care about this ms granularity.
> >
> > Right, I noticed this feature. I wonder if we can change the check in
> > tcp_timewait_state_process() like this patch if it has no side
> > effects? I'm worried that some programs don't use this feature. It's
> > the reason why I try to propose this patch to you.
>
> Breaking RFC ? I do not think so.
>
> Instead, use a usec clock and voila, the problem is solved.

I wonder if it is necessary to use BPF to extend the usec clock. Since
I started investigating the BPF part, I found it can help us extend
many useful features in TCP to the most extent.

Using it with the BPF program can make the feature take effect and
widely used in future.

I'm asking because I somehow have a feeling that someday the majority
of traditional sockopts could be replaced by BPF. After all, requiring
application modification for those kernel features is a bit heavy. I'm
not sure if I'm right about it :S

Thanks,
Jason