netdev - Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+WnwgpGy4v=aXsjThPBA2FQzWx9Y=ycXWWGLDdtDHBig@mail.gmail.com>
Date: Fri, 28 Jul 2023 10:50:40 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Menglong Dong <menglong8.dong@...il.com>
Cc: Neal Cardwell <ncardwell@...gle.com>, davem@...emloft.net, kuba@...nel.org, 
	pabeni@...hat.com, dsahern@...nel.org, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Menglong Dong <imagedong@...cent.com>, 
	Yuchung Cheng <ycheng@...gle.com>, Soheil Hassas Yeganeh <soheil@...gle.com>
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by
 icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@...il.com> wrote:
>
> On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> >
> > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@...il.com> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@...il.com> wrote:
> > > > >
> > > > > From: Menglong Dong <imagedong@...cent.com>
> > > > >
> > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > right all the time.
> > > > >
> > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > TCP_RTO_MAX sooner or later.
> > > > >
> > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > >
> > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > once the RTO come up to TCP_RTO_MAX.
> > > > >
> > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > which is exact the timestamp of the timeout.
> > > > >
> > > > > Signed-off-by: Menglong Dong <imagedong@...cent.com>
> > > > > ---
> > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > --- a/net/ipv4/tcp_timer.c
> > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > >                 }
> > > > >  #endif
> > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > +                * packet.
> > > > > +                */
> > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > >                         tcp_write_err(sk);
> > > > >                         goto out;
> > > > >                 }
> > > >
> > > >
> > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > other patches ?
> > >
> > > Yeah, this patch can be standalone. However, considering the
> > > purpose of this series, it is necessary. Without this patch, the
> > > OOM probe will always timeout after a few minutes.
> > >
> > > I'm not sure if I express the problem clearly in the commit log.
> > > Let's explain it more.
> > >
> > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > as TS1. Now, the retransmission happens and the ACK of
> > > the retransmitted packet will update the tp->rcv_tstamp to
> > > TS1+rtt.
> > >
> > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > happen in the 11th timeout. As we timeout after 122877ms,
> > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > >
> > > >
> > > > Neal, what do you think ?
> >
> > Sorry, I am probably missing something here, but: what would ever make
> > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > TCP_RTO_MAX) true? :-)
> >
>
> If the snd_wnd is 0, we need to keep probing until the window
> is available. Meanwhile, any retransmission that don't have
> a corresponding ACK (see what we do in the 1st patch), which
> can be caused by the lost of the ACK or the lost of the retransmitted
> packet, can make the condition true, as the tp->rcv_tstamp can't be
> updated in time.
>
> This is a little strict here. In the tcp_probe_timer(), we are allowed to
> retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> we don't allow packets to be lost here.
>
> > In your nicely explained scenario, your new expression,
> > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> >
> >   icsk->icsk_timeout - tp->rcv_tstamp
> > = TS1 + 120 sec      - (TS1+rtt)
> > = 120 sec - RTT
> >
> > AFAICT there is no way for that expression to be bigger than
> > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> >
> > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > well remove the if check and the body of the if block?
> >
>
> Hmm......as I explained above, the condition will be true
> if the real packet loss happens. And I think it is the origin
> design.
>
> > To me such a change does not seem like a safe and clear bug fix for
> > the "net" branch but rather a riskier design change (appropriate for
> > "net-next" branch) that has connections retry forever when the
> > receiver retracts the window to zero, under the estimation that this
> > is preferable to having the connections die in such a case.
> >
> > There might be apps that depend on the old behavior of having
> > connections die in such cases, so we might want to have this new
> > fail-faster behavior guarded by a sysctl in case some sites need to
> > revert to the older behavior? Not sure...
>
> Yeah, the behavior here will be different for the users. I'm not
> sure if there are any users that rely on such behavior.
>
> What do you think, Eric? Do we need a sysctl here?
>

I honestly do not know what problem you want to solve.

As Neal pointed out, the new condition would not trigger,
so one can question about the whole piece of code,
what is its purpose exactly ?

When receiving WIN 0 acks, we should enter the so called probe0 state.
Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
because win probe acks do not really ensure memory is available to
receive new packets ?

Maybe provide a packetdrill test to show what kind of issue you are facing...

In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
following test runs fine.

// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish a connection.
  +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 457
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 20000) = 20000
   +0 > P. 1:10001(10000) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

// Send zwp since we received rwin of 0 and have data to send.
  +.3 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

  +.6 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

 +1.2 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

 +2.4 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win  0

 +4.8 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win  0

 +9.6 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win  0

+19.2 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

+30   > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

+30   > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 193

// Received non-zero window update. Send rest of the data.
   +0 > P. 10001:20001(10000) ack 1
  +.1 < . 1:1(0) ack 20001 win 457