netdev - Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADxym3ZiyYK7Vyz05qLv8jOPmNZXXepCsTbZxdkhSQxRx0cdSA@mail.gmail.com>
Date: Thu, 18 May 2023 22:11:51 +0800
From: Menglong Dong <menglong8.dong@...il.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: Eric Dumazet <edumazet@...gle.com>, kuba@...nel.org, davem@...emloft.net, 
	pabeni@...hat.com, dsahern@...nel.org, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Menglong Dong <imagedong@...cent.com>, 
	Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Wed, May 17, 2023 at 10:35 PM Menglong Dong <menglong8.dong@...il.com> wrote:
> >
> > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Wed, May 17, 2023 at 2:42 PM <menglong8.dong@...il.com> wrote:
> > > >
> > > > From: Menglong Dong <imagedong@...cent.com>
> > > >
> > > > Window shrink is not allowed and also not handled for now, but it's
> > > > needed in some case.
> > > >
> > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > data in the retrans queue and the receive window can't hold the data
> > > > of the 1th packet in the send queue.
> > > >
> > > > Now, let's change it and trigger the 0 probe in such cases:
> > > >
> > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > the receive window
> > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > out of the end of the receive window
> > >
> > > Sorry, I do not understand.
> > >
> > > Please provide packetdrill tests for new behavior like that.
> > >
> >
> > Yes. The problem can be reproduced easily.
> >
> > 1. choose a server machine, decrease it's tcp_mem with:
> >     echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > 2. call listen() and accept() on a port, such as 8888. We call
> >     accept() looply and without call recv() to make the data stay
> >     in the receive queue.
> > 3. choose a client machine, and create 100 TCP connection
> >     to the 8888 port of the server. Then, every connection sends
> >     data about 1M.
> > 4. we can see that some of the connection enter the 0-probe
> >     state, but some of them keep retrans again and again. As
> >     the server is up to the tcp_mem[2] and skb is dropped before
> >     the recv_buf full and the connection enter 0-probe state.
> >     Finially, some of these connection will timeout and break.
> >
> > With this series, all the 100 connections will enter 0-probe
> > status and connection break won't happen. And the data
> > trans will recover if we increase tcp_mem or call 'recv()'
> > on the sockets in the server.
> >
> > > Also, such fundamental change would need IETF discussion first.
> > > We do not want linux to cause network collapses just because billions
> > > of devices send more zero probes.
> >
> > I think it maybe a good idea to make the connection enter
> > 0-probe, rather than drop the skb silently. What 0-probe
> > meaning is to wait for space available when the buffer of the
> > receive queue is full. And maybe we can also use 0-probe
> > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > is full?
> >
> > Am I right?
> >
> > Thanks!
> > Menglong Dong
>
> Thanks for describing the scenario in more detail. (Some kind of
> packetdrill script or other program to reproduce this issue would be
> nice, too, as Eric noted.)
>
> You mention in step (4.) above that some of the connections keep
> retransmitting again and again. Are those connections receiving any
> ACKs in response to their retransmissions? Perhaps they are receiving
> dupacks?

Actually, these packets are dropped without any reply, even dupacks.
skb will be dropped directly when tcp_try_rmem_schedule()
fails in tcp_data_queue(). That's reasonable, as it's
useless to reply a ack to the sender, which will cause the sender
fast retrans the packet, because we are out of memory now, and
retrans can't solve the problem.

> If so, then perhaps we could solve this problem without
> depending on a violation of the TCP spec (which says the receive
> window should not be retracted) in the following way: when a data
> sender suffers a retransmission timeout, and retransmits the first
> unacknowledged segment, and receives a dupack for SND.UNA instead of
> an ACK covering the RTO-retransmitted segment, then the data sender
> should estimate that the receiver doesn't have enough memory to buffer
> the retransmitted packet. In that case, the data sender should enter
> the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> call tcp_probe_timer().
>
> Basically we could try to enhance the sender-side logic to try to
> distinguish between two kinds of problems:
>
> (a) Repeated data packet loss caused by congestion, routing problems,
> or connectivity problems. In this case, the data sender uses
> ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> retries sysctl_tcp_retries2 times before timing out the connection
>
> (b) A receiver that is repeatedly sending dupacks but not ACKing
> retransmitted data because it doesn't have any memory. In this case,
> the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> off but keeps retrying as long as the data sender receives ACKs.
>

I'm not sure if this is an ideal method, as it may be not rigorous
to conclude that the receiver is oom with dupacks. A packet can
loss can also cause multi dupacks.

Thanks!
Menglong Dong

> AFAICT that would be another way to reach the happy state you mention:
> "all the 100 connections will enter 0-probe status and connection
> break won't happen", and we could reach that state without violating
> the TCP protocol spec and without requiring changes on the receiver
> side (so that this fix could help in scenarios where the
> memory-constrained receiver is an older stack without special new
> behavior).
>
> Eric, Yuchung, Menglong: do you think something like that would work?
>
> neal