netdev - Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eb6121ea-b02d-672e-25c9-2ad054d49fc7@gmail.com>
Date:   Thu, 11 Jul 2019 11:19:45 +0200
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Christoph Paasch <christoph.paasch@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>
Cc:     "Prout, Andrew - LLSC - MITLL" <aprout@...mit.edu>,
        David Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Jonathan Looney <jtl@...flix.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Tyler Hicks <tyhicks@...onical.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Bruce Curtis <brucec@...flix.com>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Dustin Marquess <dmarquess@...le.com>
Subject: Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory
 limits



On 7/11/19 9:28 AM, Christoph Paasch wrote:
> 
> 
>> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>>
>>
>>
>> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>>>
>>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>>>
>>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>>> -The send socket was non-blocking
>>> -SO_SNDBUF set to 128KiB
>>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>>> -Load was on both systems: a while(1) program spinning on each CPU core
>>> -The receiver was on an older unaffected kernel
>>>
>>
>> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
>> since skbs needs to be allocated for retransmits.
> 
> Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?

4.15+ kernels have :

if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
    tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) {


Meaning that things like TLP will succeed.

Anything we add in TCP stack to overcome the SO_SNDBUF by twice the limit _will_ be exploited at scale.

I am not sure we want to continue to support small SO_SNDBUF values, as this makes no sense today.

We use 64 MB auto tuning limit, and /proc/sys/net/ipv4/tcp_notsent_lowat to 1 MB.

I would rather work (when net-next reopens) on better collapsing at rtx to allow reduction of the overhead.


Something like :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f6a9c95a48edb234e4d4e21bf585744fbaf9a0a7..d5c85986209cd162cf39edb787b1385cb2c8b630 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2860,7 +2860,7 @@ static int __net_init tcp_sk_init(struct net *net)
        net->ipv4.sysctl_tcp_early_retrans = 3;
        net->ipv4.sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION;
        net->ipv4.sysctl_tcp_slow_start_after_idle = 1; /* By default, RFC2861 behavior.  */
-       net->ipv4.sysctl_tcp_retrans_collapse = 1;
+       net->ipv4.sysctl_tcp_retrans_collapse = 3;
        net->ipv4.sysctl_tcp_max_reordering = 300;
        net->ipv4.sysctl_tcp_dsack = 1;
        net->ipv4.sysctl_tcp_app_win = 31;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d61264cf89ef66b229ecf797c1abfb7fcdab009f..05cd264f98b084f62eaf2ef9d6e14a392670d02c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3015,8 +3015,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
        next_skb_size = next_skb->len;
 
-       BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
-
        if (next_skb_size) {
                if (next_skb_size <= skb_availroom(skb))
                        skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
@@ -3054,8 +3052,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 /* Check if coalescing SKBs is legal. */
 static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 {
-       if (tcp_skb_pcount(skb) > 1)
-               return false;
        if (skb_cloned(skb))
                return false;
        /* Some heuristics for collapsing over SACK'd could be invented */
@@ -3114,7 +3110,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        unsigned int cur_mss;
-       int diff, len, err;
+       int diff, len, maxlen, err;
 
 
        /* Inconclusive MTU probe */
@@ -3165,12 +3161,13 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
                        return -ENOMEM;
 
                diff = tcp_skb_pcount(skb);
+               maxlen = (sock_net(sk)->ipv4.sysctl_tcp_retrans_collapse & 2) ? len : cur_mss;
+               if (skb->len < maxlen)
+                       tcp_retrans_try_collapse(sk, skb, maxlen);
                tcp_set_skb_tso_segs(skb, cur_mss);
                diff -= tcp_skb_pcount(skb);
                if (diff)
                        tcp_adjust_pcount(sk, skb, diff);
-               if (skb->len < cur_mss)
-                       tcp_retrans_try_collapse(sk, skb, cur_mss);
        }
 
        /* RFC3168, section 6.1.1.1. ECN fallback */