linux-kernel - Re: [RFC 1/3] tcp: Consider mtu probing for tcp_xmit_size

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e6a3dfab-ccea-1a0c-6fd8-bfca466aefba@gmail.com>
Date:   Mon, 17 May 2021 16:42:35 +0300
From:   Leonard Crestez <cdleonard@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>,
        Matt Mathis <mattmathis@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>
Cc:     "David S. Miller" <davem@...emloft.net>,
        Willem de Bruijn <willemb@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        John Heffner <johnwheffner@...il.com>,
        Leonard Crestez <lcrestez@...venets.com>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Roopa Prabhu <roopa@...ulusnetworks.com>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC 1/3] tcp: Consider mtu probing for tcp_xmit_size_goal

On 5/11/21 4:04 PM, Eric Dumazet wrote:
> On Tue, May 11, 2021 at 2:04 PM Leonard Crestez <cdleonard@...il.com> wrote:
>>
>> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
>> in order to accumulate enough data" but linux almost never does that.
>>
>> Linux checks for (probe_size + (1 + reorder) * mss_cache) bytes to be
>> available in the send buffer and if that condition is not met it will
>> send anyway using the current MSS. The feature can be made to work by
>> sending very large chunks of data from userspace (for example 128k) but
>> for small writes on fast links tcp mtu probes almost never happen.
> 
> Why should they happen ?
> 
> I am not sure the kernel should perform extra checks just because
> applications are not properly written.

My tests show that application writing a few kb at a time almost never 
trigger MTU probing enough to reach 9200. The reasons for this are very 
difficult for me to understand.

It seems that only writing in very large chunks like 160k makes it 
happen, much more than the size_needed calculated inside tcp_mtu_probing 
(which is about 50k). This seems unreasonable. Ideally linux should try 
to accumulate enough data for a probe (as the RFC suggests) but at least 
it should send probes that fit inside a single userspace write.

I dug a little deeper and what seems to happen is this:

  * size_needed is ~60k
  * once the head of the queue reached size_needed tcp_push_one is 
called which sends everything ignoring MTU probing
  * size_needed is reached again and tcp_push_pending_frames is called. 
At this point the cwnd has shrunk < 11 (due to the previous burst) so 
probing is skipped again in favor of just sending in mss-sized chunks.

This happens repeatedly, a sender-limited app performing periodic 128k 
writes will see MSS stuck below MTU.

I don't understand the push_one logic and why it completely skips mtu 
probing, it seems like an optimization which doesn't take RFC4821 into 
account.

>> This patch tries to take mtu probe into account in tcp_xmit_size_goal, a
>> function which otherwise attempts to accumulate a packet suitable for
>> TSO. No delays are introduced beyond existing autocork heuristics.
> 
> 
> MTU probing should not be attempted for every write().
> This belongs to some kind of slow path, once in a while.

MTU probing is only attempted every 10 minutes but once a probe is 
pending it does have a slight impact on every write. This is already the 
case, tcp_write_xmit calls tcp_mtu_probe almost every time.

I had an idea for reducing the overhead in tcp_size_needed but it turns 
out I was indeed mistaken about what this function does. I thought it 
returned ~mss when all GSO is disabled but this is not so.

>>   static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>>                                         int large_allowed)
>>   {
>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>>          struct tcp_sock *tp = tcp_sk(sk);
>>          u32 new_size_goal, size_goal;
>>
>>          if (!large_allowed)
>>                  return mss_now;
>> @@ -932,11 +933,19 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>>                  tp->gso_segs = min_t(u16, new_size_goal / mss_now,
>>                                       sk->sk_gso_max_segs);
>>                  size_goal = tp->gso_segs * mss_now;
>>          }
>>
>> -       return max(size_goal, mss_now);
>> +       size_goal = max(size_goal, mss_now);
>> +
>> +       if (unlikely(icsk->icsk_mtup.wait_data)) {
>> +               int mtu_probe_size_needed = tcp_mtu_probe_size_needed(sk, NULL);
>> +               if (mtu_probe_size_needed > 0)
>> +                       size_goal = max(size_goal, (u32)mtu_probe_size_needed);
>> +       }
> 
> 
> I think you are mistaken.
> 
> This function usually returns 64KB depending on MSS.
>   Have you really tested this part ?

I assumed that with all gso features disabled this function returns one 
MSS but this is not true. My patch had a positive effect just because I 
made tcp_mtu_probing return "0" instead of "-1" if not enough data is 
queued.

I don't fully understand the implications of that change though. If 
tcp_mtu_probe returns zero what guarantee is there that data will 
eventually be sent even if no further userspace writes happen?

I'd welcome any suggestions.

--
Regards,
Leonard