netdev - Re: [net,v2] tcp: correct handling of extreme memory squeeze

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <afb9ff14-a2f1-4c5a-a920-bce0105a7d41@redhat.com>
Date: Mon, 20 Jan 2025 00:03:08 -0500
From: Jon Maloy <jmaloy@...hat.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, kuba@...nel.org,
 passt-dev@...st.top, sbrivio@...hat.com, lvivier@...hat.com,
 dgibson@...hat.com, imagedong@...cent.com, eric.dumazet@...il.com,
 edumazet@...gle.com
Subject: Re: [net,v2] tcp: correct handling of extreme memory squeeze



On 2025-01-18 15:04, Neal Cardwell wrote:
> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@...hat.com> wrote:
>>
>> From: Jon Maloy <jmaloy@...hat.com>
>>
>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>> a bug in the way tcp handles window advertising in extreme memory
>> squeeze situations.
>>
>> Under memory pressure, a socket endpoint may temporarily advertise
>> a zero-sized window, but this is not stored as part of the socket data.
>> The reasoning behind this is that it is considered a temporary setting
>> which shouldn't influence any further calculations.
>>
>> However, if we happen to stall at an unfortunate value of the current
>> window size, the algorithm selecting a new value will consistently fail
>> to advertise a non-zero window once we have freed up enough memory.
> 
> The "if we happen to stall at an unfortunate value of the current
> window size" phrase is a little vague... :-) Do you have a sense of
> what might count as "unfortunate" here? That might help in crafting a
> packetdrill test to reproduce this and have an automated regression
> test.

Obviously, it happens when the following code snippet in

__tcp_cleanup_rbuf() {
    [....]
    if (copied > 0 && !time_to_ack &&
        !(sk->sk_shutdown & RCV_SHUTDOWN)) {
             __u32 rcv_window_now = tcp_receive_window(tp);

             /* Optimize, __tcp_select_window() is not cheap. */
             if (2*rcv_window_now <= tp->window_clamp) {
                 __u32 new_window = __tcp_select_window(sk);

                 /* Send ACK now, if this read freed lots of space
                  * in our buffer. Certainly, new_window is new window.
                  * We can advertise it now, if it is not less than 

                  * current one.
                  * "Lots" means "at least twice" here.
                  */
                 if (new_window && new_window >= 2 * rcv_window_now)
                         time_to_ack = true;
            }
     }
     [....]
}

yields time_to_ack = false, i.e.  __tcp_select_window(sk) returns
a value new_window  < (2 *  tcp_receive_window(tp)).

In my log I have for brevity used the following names:

win_now: same as rcv_window_now
     (= tcp_receive_window(tp),
      = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
      = 265469200 + 262144 -  265600160,
      = 131184)

new_win: same as new_window
      (= __tcp_select_window(sk),
       = 0 first time, later 262144 )

rcv_wnd: same as tp->rcv_wnd,
       (=262144)

We see that although the last test actually is pretty close
(262144 >= 262368 ? => false) it is not close enough.


We also notice that
(tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
(new_window >= 2 * rcv_window_now) will always be false.


Too me it looks like __tcp_select_window(sk) doesn't at all take the 
freed-up memory into account when calculating a new window. I haven't
looked into why that is happening.

> 
>> This means that this side's notion of the current window size is
>> different from the one last advertised to the peer, causing the latter
>> to not send any data to resolve the sitution.
> 
> Since the peer last saw a zero receive window at the time of the
> memory-pressure drop, shouldn't the peer be sending repeated zero
> window probes, and shouldn't the local host respond to a ZWP with an
> ACK with the correct non-zero window?

It should, but at the moment when I found this bug the peer stack was 
not the Linux kernel stack, but one we develop for our own purpose. We 
fixed that later, but it still means that traffic stops for a couple of 
seconds now and then before the timer restarts the flow. This happens
too often for comfort in our usage scenarios.
We can of course blame the the peer stack, but I still feel this is a
bug, and that it could be handled better by the kernel stack.
> 
> Do you happen to have a tcpdump .pcap of one of these cases that you can share?

I had one, although not for this particular run, and I cannot find it 
right now. I will continue looking or make a new one. Is there some 
shared space I can put it?

> 
>> The problem occurs on the iperf3 server side, and the socket in question
>> is a completely regular socket with the default settings for the
>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>
>> The following excerpt of a logging session, with own comments added,
>> shows more in detail what is happening:
>>
>> //              tcp_v4_rcv(->)
>> //                tcp_rcv_established(->)
>> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
>> [5201<->39222]:     tcp_data_queue(->)
>> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>>                         [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> 
> What is "win_now"? That doesn't seem to correspond to any variable
> name in the Linux source tree. 

See above.

  Can this be renamed to the
> tcp_select_window() variable it is printing, like "cur_win" or
> "effective_win" or "new_win", etc?
> 
> Or perhaps you can attach your debugging patch in some email thread? I
> agree with Eric that these debug dumps are a little hard to parse
> without seeing the patch that allows us to understand what some of
> these fields are...
> 
> I agree with Eric that probably tp->pred_flags should be cleared, and
> a packetdrill test for this would be super-helpful.

I must admit I have never used packetdrill, but I can make an effort.

///jon

> 
> thanks,
> neal
>