linux-kernel - Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQyk4Wew4iv9WRGAOiJE6fKD+OuvnwF8UQcDf+G5Ou25yUA@mail.gmail.com>
Date:   Tue, 23 May 2023 09:27:17 -0400
From:   Neal Cardwell <ncardwell@...gle.com>
To:     Menglong Dong <menglong8.dong@...il.com>
Cc:     Eric Dumazet <edumazet@...gle.com>, kuba@...nel.org,
        davem@...emloft.net, pabeni@...hat.com, dsahern@...nel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        Menglong Dong <imagedong@...cent.com>,
        Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Tue, May 23, 2023 at 4:59 AM Menglong Dong <menglong8.dong@...il.com> wrote:
>
> On Mon, May 22, 2023 at 11:04 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> >
> > On Sun, May 21, 2023 at 10:55 PM Menglong Dong <menglong8.dong@...il.com> wrote:
> > >
> > > On Sat, May 20, 2023 at 10:28 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > >
> > > > On Sat, May 20, 2023 at 5:08 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> > > > >
> > > > > On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 10:12 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <menglong8.dong@...il.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, May 17, 2023 at 2:42 PM <menglong8.dong@...il.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > From: Menglong Dong <imagedong@...cent.com>
> > > > > > > > > > >
> > > > > > > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > > > > > > needed in some case.
> > > > > > > > > > >
> > > > > > > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > > > > > > of the 1th packet in the send queue.
> > > > > > > > > > >
> > > > > > > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > > > > > > >
> > > > > > > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > > > > > > the receive window
> > > > > > > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > > > > > > out of the end of the receive window
> > > > > > > > > >
> > > > > > > > > > Sorry, I do not understand.
> > > > > > > > > >
> > > > > > > > > > Please provide packetdrill tests for new behavior like that.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes. The problem can be reproduced easily.
> > > > > > > > >
> > > > > > > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > > > > > >     echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > > > > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > > > > > >     accept() looply and without call recv() to make the data stay
> > > > > > > > >     in the receive queue.
> > > > > > > > > 3. choose a client machine, and create 100 TCP connection
> > > > > > > > >     to the 8888 port of the server. Then, every connection sends
> > > > > > > > >     data about 1M.
> > > > > > > > > 4. we can see that some of the connection enter the 0-probe
> > > > > > > > >     state, but some of them keep retrans again and again. As
> > > > > > > > >     the server is up to the tcp_mem[2] and skb is dropped before
> > > > > > > > >     the recv_buf full and the connection enter 0-probe state.
> > > > > > > > >     Finially, some of these connection will timeout and break.
> > > > > > > > >
> > > > > > > > > With this series, all the 100 connections will enter 0-probe
> > > > > > > > > status and connection break won't happen. And the data
> > > > > > > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > > > > > > on the sockets in the server.
> > > > > > > > >
> > > > > > > > > > Also, such fundamental change would need IETF discussion first.
> > > > > > > > > > We do not want linux to cause network collapses just because billions
> > > > > > > > > > of devices send more zero probes.
> > > > > > > > >
> > > > > > > > > I think it maybe a good idea to make the connection enter
> > > > > > > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > > > > > > meaning is to wait for space available when the buffer of the
> > > > > > > > > receive queue is full. And maybe we can also use 0-probe
> > > > > > > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > > > > > > is full?
> > > > > > > > >
> > > > > > > > > Am I right?
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Menglong Dong
> > > > > > > >
> > > > > > > > Thanks for describing the scenario in more detail. (Some kind of
> > > > > > > > packetdrill script or other program to reproduce this issue would be
> > > > > > > > nice, too, as Eric noted.)
> > > > > > > >
> > > > > > > > You mention in step (4.) above that some of the connections keep
> > > > > > > > retransmitting again and again. Are those connections receiving any
> > > > > > > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > > > > > > dupacks?
> > > > > > >
> > > > > > > Actually, these packets are dropped without any reply, even dupacks.
> > > > > > > skb will be dropped directly when tcp_try_rmem_schedule()
> > > > > > > fails in tcp_data_queue(). That's reasonable, as it's
> > > > > > > useless to reply a ack to the sender, which will cause the sender
> > > > > > > fast retrans the packet, because we are out of memory now, and
> > > > > > > retrans can't solve the problem.
> > > > > >
> > > > > > I'm not sure I see the problem. If retransmits can't solve the
> > > > > > problem, then why are you proposing that data senders keep
> > > > > > retransmitting forever (via 0-window-probes) in this kind of scenario?
> > > > > >
> > > > >
> > > > > Because the connection will break if the count of
> > > > > retransmits up to tcp_retires2, but probe-0 can keep
> > > > > for a long time.
> > > >
> > > > I see. So it sounds like you agree that retransmits can solve the
> > > > problem, as long as the retransmits are using the zero-window probe
> > > > state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
> > > > as long as the receiver is sending ACKs. And it sounds like when you
> > > > said "retrans can't solve the problem" you didn't literally mean that
> > > > retransmits can't solve the problem, but rather you meant that the RTO
> > > > state machine, specifically (ICSK_TIME_RETRANS,
> > > > tcp_retransmit_timer(), etc) can't solve the problem. I agree with
> > > > that assessment that in this scenario tcp_probe_timer() seems like a
> > > > solution but tcp_retransmit_timer() does not.
> > > >
> > >
> > > Yes, that is indeed what I want to express.
> > >
> > > > > > A single dupack without SACK blocks will not cause the sender to fast
> > > > > > retransmit. (Only 3 dupacks would trigger fast retransmit.)
> > > > > >
> > > > > > Three or more dupacks without SACK blocks will cause the sender to
> > > > > > fast retransmit the segment above SND.UNA once if the sender doesn't
> > > > > > have SACK support. But in this case AFAICT fast-retransmitting once is
> > > > > > a fine strategy, since the sender should keep retrying transmits (with
> > > > > > backoff) until the receiver potentially has memory available to
> > > > > > receive the packet.
> > > > > >
> > > > > > >
> > > > > > > > If so, then perhaps we could solve this problem without
> > > > > > > > depending on a violation of the TCP spec (which says the receive
> > > > > > > > window should not be retracted) in the following way: when a data
> > > > > > > > sender suffers a retransmission timeout, and retransmits the first
> > > > > > > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > > > > > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > > > > > > should estimate that the receiver doesn't have enough memory to buffer
> > > > > > > > the retransmitted packet. In that case, the data sender should enter
> > > > > > > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > > > > > > call tcp_probe_timer().
> > > > > > > >
> > > > > > > > Basically we could try to enhance the sender-side logic to try to
> > > > > > > > distinguish between two kinds of problems:
> > > > > > > >
> > > > > > > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > > > > > > or connectivity problems. In this case, the data sender uses
> > > > > > > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > > > > > > retries sysctl_tcp_retries2 times before timing out the connection
> > > > > > > >
> > > > > > > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > > > > > > retransmitted data because it doesn't have any memory. In this case,
> > > > > > > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > > > > > > off but keeps retrying as long as the data sender receives ACKs.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not sure if this is an ideal method, as it may be not rigorous
> > > > > > > to conclude that the receiver is oom with dupacks. A packet can
> > > > > > > loss can also cause multi dupacks.
> > > > > >
> > > > > > When a data sender suffers an RTO and retransmits a single data
> > > > > > packet, it would be very rare for the data sender to receive multiple
> > > > > > pure dupacks without SACKs. This would only happen in the rare case
> > > > > > where (a) the connection did not have SACK enabled, and (b) there was
> > > > > > a hole in the received sequence space and there were still packets in
> > > > > > flight when the (spurioius) RTO fired.
> > > > > >
> > > > > > But if we want to be paranoid, then this new response could be written
> > > > > > to only trigger if SACK is enabled (the vast, vast majority of cases).
> > > > > > If SACK is enabled, and an RTO of a data packet starting at sequence
> > > > > > S1 results in the receiver sending only a dupack for S1 without SACK
> > > > > > blocks, then this clearly shows the issue is not packet loss but
> > > > > > suggests a receiver unable to buffer the given data packet, AFAICT.
> > > > > >
> > > > >
> > > > > Yeah, you are right on this point, multi pure dupacks can
> > > > > mean out of memory of the receiver. But we still need to
> > > > > know if the receiver recovers from OOM. Without window
> > > > > shrink, the window in the ack of zero-window probe packet
> > > > > is not zero on OOM.
> > > >
> > > > But do we need a protocol-violating zero-window in this case? Why not
> > > > use my approach suggested above: conveying the OOM condition by
> > > > sending an ACK but not ACKing the retransmitted packet?
> > > >
> > >
> > > I agree with you about the approach you mentioned
> > > about conveying the OOM condition. But that approach
> > > can't convey the recovery from OOM, can it?
> >
> > Yes, my suggested approach can convey the recovery from OOM. The data
> > receiver conveys the recovery from OOM by buffering and ACKing the
> > retransmitted data packet.
>
> Oh, I understand what you mean now. You are saying that
> retransmit that first packet in the retransmit queue instead
> of zero-window probe packet when OOM of the receiver,
> isn't it? In other word, retransmit the unacked data and ignore
> the tcp_retries2 when we find the receiver is in OOM state.

Yes. The idea would be to use a heuristic to estimate the receiver is
currently OOM and use ICSK_TIME_PROBE0 / tcp_probe_timer() /
tcp_write_wakeup() in this case instead of ICSK_TIME_RETRANS /
tcp_retransmit_timer().

> That's an option, and we can make the length of the data we
> send to 1 byte, which means we keep retransmitting the first
> byte that has not be acked in the retransmit queue.

I don't think it would be worth adding special-case code to only send
1 byte.  If the data receiver is not OOM then for CPU and memory
efficiency reasons (as well as simplicity) the data sender should send
it a full MSS. So for those reasons I would suggest that in this
potential approach tcp_write_wakeup() should stay the same.

neal