netdev - Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89i+L2DuD2+EMHzwZ=qYYKo1A9gw=nTTmh20GV_o9ADxe2Q@mail.gmail.com>
Date:   Wed, 28 Apr 2021 14:20:13 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     Matt Corallo <netdev-list@...tcorallo.com>
Cc:     "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Willy Tarreau <w@....eu>, Keyu Man <kman001@....edu>
Subject: Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout
 to 1s, from 30s

On Wed, Apr 28, 2021 at 4:29 AM Matt Corallo
<netdev-list@...tcorallo.com> wrote:
>
> The default IP reassembly timeout of 30 seconds predates git
> history (and cursory web searches turn up nothing related to it).
> The only relevant source cited in net/ipv4/ip_fragment.c is RFC
> 791 defining IPv4 in 1981. RFC 791 suggests allowing the timer to
> increase on the receipt of each fragment (which Linux deliberately
> does not do), with a default timeout for each fragment of 15
> seconds. It suggests 15s to cap a 10Kb/s flow to a 150Kb buffer of
> fragments.
>
> When Linux receives a fragment, if the total memory used for the
> fragment reassembly buffer (across all senders) exceeds
> net.ipv4.ipfrag_high_thresh (or the equivalent for IPv6), it
> silently drops all future fragments fragments until the timers on
> the original expire.
>
> All the way in 2021, these numbers feel almost comical. The default
> buffer size for fragmentation reassembly is hard-coded at 4MiB as
> `net->ipv4.fqdir->high_thresh = 4 * 1024 * 1024;` capping a host at
> 1.06Mb/s of lost fragments before all fragments received on the
> host are dropped (with independent limits for IPv6).
>
> At 1.06Mb/s of lost fragments, we move from DoS attack threshold to
> real-world scenarios - at moderate loss rates on consumer networks
> today its fairly easy to hit this, causing end hosts with their MTU
> (mis-)configured to fragment to have nearly 10-20 second blocks of
> 100% packet loss.
>
> Reducing the default fragment timeout to 1sec gives us 32Mb/s of
> fragments before we drop all fragments, which is certainly more in
> line with today's network speeds than 1.06Mb/s, though an optimal
> value may be still lower. Sadly, reducing it further requires a
> change to the sysctl interface, as net.ipv4.ipfrag_time is only
> specified in seconds.
> ---
>   include/net/ip.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 2d6b985d11cc..f1473ac5a27c 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -135,7 +135,7 @@ struct ip_ra_chain {
>   #define IP_MF        0x2000        /* Flag: "More Fragments"    */
>   #define IP_OFFSET    0x1FFF        /* "Fragment Offset" part    */
>
> -#define IP_FRAG_TIME    (30 * HZ)        /* fragment lifetime    */
> +#define IP_FRAG_TIME    (1 * HZ)        /* fragment lifetime    */
>
>   struct msghdr;
>   struct net_device;
> --
> 2.30.2


This is going to break many use cases.

I can certainly say that in many cases, we need more than 1 second to
complete reassembly.
Some Internet users share satellite links with 600 ms RTT, not
everybody has fiber links in 2021.

There is a sysctl, exactly for the cases where admins can decide to
make the value smaller.

You can laugh all you want, the sad thing with IP frags is that really
some applications still want to use them.

Also, admins willing to use 400 MB of memory instead of 4MB can just
change a sysctl.

Again, nothing will prevent reassembly units to be DDOS targets.

At Google, we use 100 MB for /proc/sys/net/ipv4/ipfrag_high_thresh and
/proc/sys/net/ipv6/ip6frag_high_thresh,
no kernel patch is needed.