[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHx7jf807SHbTZhF4LeWsesSPnYxeE6vO37vTGXp+dr-65JP+w@mail.gmail.com>
Date: Wed, 16 Apr 2025 19:30:27 -0300
From: Luiz Carlos Mourão Paes de Carvalho <luizcmpc@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND
is 0
On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
> <luizcmpc@...il.com> wrote:
> >
> > Hi Paolo,
> >
> > The dropped ack is a response to data sent by the peer.
> >
> > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> > by the peer's tcp_sequence function. Connection only advances when we send a RTO.
> >
> > Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> > the expected interaction to the patch if it makes sense.
> >
> > // Tests the invalid SEQs sent by the listener
> > // which are then dropped by the peer.
> >
> > `./common/defaults.sh
> > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
> >
> > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > +0 bind(3, ..., ...) = 0
> > +0 listen(3, 1) = 0
> >
> > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> > +0 > S. 0:0(0) ack 1 <...>
> > +.1 < . 1:1(0) ack 1 win 8
> > +0 accept(3, ..., ...) = 4
> >
> > +0 write(4, ..., 990) = 990
> > +0 > P. 1:991(990) ack 1
> > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0
> >
> > +0 write(4, ..., 100) = 100
> > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091
> > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0
> >
> > +0.1 < P. 1:1001(1000) ack 901 win 0
>
> This 'ack 901' does not seem right ?
It's indeed incorrect, the bug still occurs if it were 991. Sorry for that.
>
> Also your fix would not work if 'win 0' was 'win 1' , and/or if the
> initial wscale was 6 instead of 7 ?
It indeed does not work if win=1, but that's unlikely to happen unless
you enable shrink_window, and probably
suggests the mentioned loss of precision.
Now, regarding the scale, it does happen with wscale=6 if your second
write sends < 64 bytes.
This is true with any other scale. Would happen if it were wscale=1
and the second write sent 2 bytes, etc.
Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale.
>
> > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
> > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
>
> I assume that your patch would change the 1091:1091(0) to 991:991(0) ?
Precisely.
>
> It is not clear if there is a bug here... window reneging is outside
> RFC specs unfortunately,
> as hinted in the tcp_acceptable_seq() comments.
Yeah, that got me thinking as well, but although it isn't covered by
the RFC, the behavior did change since
8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
which is a relatively recent patch (Jan 2025).
Currently, the connection could stall indefinitely, which seems
unwanted. I would be happy to search for other
solutions if you have anything come to mind, though.
The way I see it, the stack shouldn't be sending invalid ACKs that are
known to be incorrect.
>
> >
> > +0.2 > P. 991:1091(100) ack 1001 // this is a RTO, ack accepted
> > +0 < P. 1001:2001(1000) ack 991 win 0 // peer responds, still no space available, but has more data to send
> > +0 > . 1091:1091(0) ack 2001 // ack dropped
> >
> > +0.3 > P. 991:1091(100) ack 2001 // RTO, ack accepted
> > +0 < . 2001:3001(1000) ack 991 win 0 // still no space available, but another chunk of data
> > +0 > . 1091:1091(0) ack 3001 // ack dropped
> >
> > +0.6 > P. 991:1091(100) ack 3001 // RTO, ack accepted
> > +0 < . 3001:4001(1000) ack 991 win 0 // no space available, but peer has data to send at all times
> > +0 > . 1091:1091(0) ack 4001 // ack dropped
> >
> > +1.2 > P. 991:1091(100) ack 4001 // another probe, accepted
> >
> > // this goes on and on. note that the peer always has data just waiting there to be sent,
> > // server acks it, but the ack is dropped because SEQ is incorrect.
> > // only the RTOs are advancing the connection, but are back-offed every time.
> >
> > // Reset sysctls
> > `/tmp/sysctl_restore_${PPID}.sh`
> >
> > On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >>
> >>
> >>
> >> On 4/10/25 7:50 PM, Luiz Carvalho wrote:
> >> > The current tcp_acceptable_seq() returns SND.NXT when the available
> >> > window shrinks to less then one scaling factor. This works fine for most
> >> > cases, and seemed to not be a problem until a slight behavior change to
> >> > how tcp_select_window() handles ZeroWindow cases.
> >> >
> >> > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> >> > a zero window would only be announced when data failed to be consumed,
> >> > and following packets would have non-zero windows despite the receiver
> >> > still not having any available space. After the commit, however, the
> >> > zero window is stored in the socket and the advertised window will be
> >> > zero until the receiver frees up space.
> >> >
> >> > For tcp_acceptable_seq(), a zero window case will result in SND.NXT
> >> > being sent, but the problem now arises when the receptor validates the
> >> > sequence number in tcp_sequence():
> >> >
> >> > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
> >> > u32 seq, u32 end_seq)
> >> > {
> >> > // ...
> >> > if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> >> > return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> >> > // ...
> >> > }
> >> >
> >> > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail
> >> > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is
> >> > dropped by the receiver, correctly, as RFC793 mentions:
> >> >
> >> > There are four cases for the acceptability test for an incoming
> >> > segment:
> >> >
> >> > Segment Receive Test
> >> > Length Window
> >> > ------- ------- -------------------------------------------
> >> >
> >> > 0 0 SEG.SEQ = RCV.NXT
> >> >
> >> > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again,
> >> > and the connection continues. If the receptor announces ZeroWindow
> >> > again, the stall could be very long, as was in my case. Found this out
> >> > while giving a shot at bug #213827.
> >>
> >> The dropped ack causing the stall is a zero window probe from the sender
> >> right?
> >> Could you please describe the relevant scenario with a pktdrill test?
> >>
> >> Thanks!
> >>
> >> Paolo
> >>
Powered by blists - more mailing lists