netdev - Re: [RFC PATCH] ip: re-introduce fragments cache worker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51ef14ac-1d98-ad75-d282-eb6cb177fe7a@gmail.com>
Date:   Fri, 6 Jul 2018 04:23:15 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org
Cc:     "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Florian Westphal <fw@...len.de>, NeilBrown <neilb@...e.com>
Subject: Re: [RFC PATCH] ip: re-introduce fragments cache worker



On 07/06/2018 03:10 AM, Paolo Abeni wrote:
> Currently, the ip frag cache is fragile to overload. With
> flow control disabled:
> 
> ./super_netperf.sh 10  -H 192.168.101.2 -t UDP_STREAM -l 60
> 9618.08
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60
> 28.66
> 
> Once that the overload condition is reached, the system does not
> recover until it's almost completely idle:
> 
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60 &
> sleep 4; I=0;
> for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; done
> 13.72
> 
> This is due to the removal of the fragment cache worker, which
> was responsible to free some IP fragment cache memory when the
> high threshold was reached, allowing the system to cope successfully
> with the next fragmented packets.
> 
> This commit re-introduces the worker, on a per netns basis. Thanks
> to rhashtable walkers we can block the bh only for an entry removal.
> 
> After this commit (and before IP frag worker removal):
> 
> ./super_netperf.sh 10  -H 192.168.101.2 -t UDP_STREAM -l 60
> 9618.08
> 
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60
> 8599.77
> 
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60 &
> sleep 4; I=0;
> for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; done
> 9623.12
> 
> Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
> Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> ---
> Note: tweaking ipfrag sysfs does not solve completely the issue:
> - raising ipfrag_high_thresh increases the number of parallels
>   connections required to degrade the tput, but reached the IP
>   fragment cache capacity the good-put still goes almost to 0,
>   with the worker we get much more nice behaviour.
> - setting ipfrag_time to 2 increases the change to recover from
>   overload (the 2# test above), but with several experiments 
>   in such test, I got an average of 50% the expected tput with a very
>   large variance, with the worker we always see the expected/
>   line rate tput.
> ---

Ho hum. No please.

I do not think adding back a GC is wise, since my patches were going in the direction
of allowing us to increase limits on current hardware.

Meaning that the amount of frags to evict would be quite big under DDOS.
(One inet_frag_queue allocated for every incoming tiny frame :/ )

A GC is a _huge_ problem, burning one cpu (you would have to provision for this CPU)
compared to letting normal per frag timer doing its job.

My plan was to reduce the per frag timer under load (default is 30 seconds), since
this is exactly what your patch is indirectly doing, by aggressively pruning
frags under stress.

That would be a much simpler heuristic. [1]

BTW my own results (before patch) are :

lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t UDP_STREAM -l 60
   9602
lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t UDP_STREAM -l 60
   9557

On receiver (normal settings here) I had :

lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_*
/proc/sys/net/ipv4/ipfrag_high_thresh:104857600
/proc/sys/net/ipv4/ipfrag_low_thresh:78643200
/proc/sys/net/ipv4/ipfrag_max_dist:0
/proc/sys/net/ipv4/ipfrag_secret_interval:0
/proc/sys/net/ipv4/ipfrag_time:30

lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat
FRAG: inuse 824 memory 53125312

[1] Something like (for IPv4 only here)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c9e35b81d0931df8429a33e8d03e719b87da0747..88ed61bcda00f3357724e5c4dbcb97400b4a8b21 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -155,9 +155,15 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
                                               struct inet_frags *f,
                                               void *arg)
 {
+       long high_thresh = READ_ONCE(nf->high_thresh);
        struct inet_frag_queue *q;
+       u64 timeout;
+       long usage;
 
-       if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh)
+       if (!high_thresh)
+               return NULL;
+       usage = frag_mem_limit(nf);
+       if (usage > high_thresh)
                return NULL;
 
        q = kmem_cache_zalloc(f->frags_cachep, GFP_ATOMIC);
@@ -171,6 +177,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
        timer_setup(&q->timer, f->frag_expire, 0);
        spin_lock_init(&q->lock);
        refcount_set(&q->refcnt, 3);
+       timeout = (u64)nf->timeout * (high_thresh - usage);
+       mod_timer(&q->timer, jiffies + div64_long(timeout, high_thresh));
 
        return q;
 }
@@ -186,8 +194,6 @@ static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
        if (!q)
                return NULL;
 
-       mod_timer(&q->timer, jiffies + nf->timeout);
-
        err = rhashtable_insert_fast(&nf->rhashtable, &q->node,
                                     f->rhash_params);
        if (err < 0) {