netdev - Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADxym3Zqb2CCpJojGiT7gVL98GDdOmjxqLY6ApLeP2zZU1Kn3Q@mail.gmail.com>
Date: Fri, 28 Jul 2023 14:25:28 +0800
From: Menglong Dong <menglong8.dong@...il.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: Eric Dumazet <edumazet@...gle.com>, davem@...emloft.net, kuba@...nel.org, 
	pabeni@...hat.com, dsahern@...nel.org, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Menglong Dong <imagedong@...cent.com>, 
	Yuchung Cheng <ycheng@...gle.com>, Soheil Hassas Yeganeh <soheil@...gle.com>
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by
 icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@...il.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@...il.com> wrote:
> > > >
> > > > From: Menglong Dong <imagedong@...cent.com>
> > > >
> > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > right all the time.
> > > >
> > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > TCP_RTO_MAX sooner or later.
> > > >
> > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > >
> > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > once the RTO come up to TCP_RTO_MAX.
> > > >
> > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > which is exact the timestamp of the timeout.
> > > >
> > > > Signed-off-by: Menglong Dong <imagedong@...cent.com>
> > > > ---
> > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > index 470f581eedd4..3a20db15a186 100644
> > > > --- a/net/ipv4/tcp_timer.c
> > > > +++ b/net/ipv4/tcp_timer.c
> > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > >                                             tp->snd_una, tp->snd_nxt);
> > > >                 }
> > > >  #endif
> > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > +               /* It's a little rough here, we regard any valid packet that
> > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > +                * packet.
> > > > +                */
> > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > >                         tcp_write_err(sk);
> > > >                         goto out;
> > > >                 }
> > >
> > >
> > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > other patches ?
> >
> > Yeah, this patch can be standalone. However, considering the
> > purpose of this series, it is necessary. Without this patch, the
> > OOM probe will always timeout after a few minutes.
> >
> > I'm not sure if I express the problem clearly in the commit log.
> > Let's explain it more.
> >
> > Let's mark the timestamp of the 10th timeout of the rtx timer
> > as TS1. Now, the retransmission happens and the ACK of
> > the retransmitted packet will update the tp->rcv_tstamp to
> > TS1+rtt.
> >
> > The RTO now is TCP_RTO_MAX. So let's see what will
> > happen in the 11th timeout. As we timeout after 122877ms,
> > so tcp_jiffies32 now is "TS1+122877ms", and
> > "tcp_jiffies32 - tp->rcv_tstamp" is
> > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> >
> > >
> > > Neal, what do you think ?
>
> Sorry, I am probably missing something here, but: what would ever make
> this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > TCP_RTO_MAX) true? :-)
>

If the snd_wnd is 0, we need to keep probing until the window
is available. Meanwhile, any retransmission that don't have
a corresponding ACK (see what we do in the 1st patch), which
can be caused by the lost of the ACK or the lost of the retransmitted
packet, can make the condition true, as the tp->rcv_tstamp can't be
updated in time.

This is a little strict here. In the tcp_probe_timer(), we are allowed to
retransmit the probe0 packet for sysctl_tcp_retries2 times. But
we don't allow packets to be lost here.

> In your nicely explained scenario, your new expression,
> icsk->icsk_timeout - tp->rcv_tstamp, will be:
>
>   icsk->icsk_timeout - tp->rcv_tstamp
> = TS1 + 120 sec      - (TS1+rtt)
> = 120 sec - RTT
>
> AFAICT there is no way for that expression to be bigger than
> TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
>
> So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> TCP_RTO_MAX) will always be false, so rather than this patch we may as
> well remove the if check and the body of the if block?
>

Hmm......as I explained above, the condition will be true
if the real packet loss happens. And I think it is the origin
design.

> To me such a change does not seem like a safe and clear bug fix for
> the "net" branch but rather a riskier design change (appropriate for
> "net-next" branch) that has connections retry forever when the
> receiver retracts the window to zero, under the estimation that this
> is preferable to having the connections die in such a case.
>
> There might be apps that depend on the old behavior of having
> connections die in such cases, so we might want to have this new
> fail-faster behavior guarded by a sysctl in case some sites need to
> revert to the older behavior? Not sure...

Yeah, the behavior here will be different for the users. I'm not
sure if there are any users that rely on such behavior.

What do you think, Eric? Do we need a sysctl here?

Thanks!
Menglong Dong

>
> neal