netdev - Re: SO_RESERVE_MEM doesn't quite work, at least on UDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iKPPt3tozuDaSfsop5YbvgRoKha=dgTR2-ReoYEvA-_DA@mail.gmail.com>
Date: Fri, 16 Feb 2024 19:36:24 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Paolo Abeni <pabeni@...hat.com>, Wei Wang <weiwan@...gle.com>, 
	Network Development <netdev@...r.kernel.org>, "David S. Miller" <davem@...emloft.net>
Subject: Re: SO_RESERVE_MEM doesn't quite work, at least on UDP

On Fri, Feb 16, 2024 at 7:00 PM Andy Lutomirski <luto@...capital.net> wrote:
>
> On Fri, Feb 16, 2024 at 12:11 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On Thu, 2024-02-15 at 13:17 -0800, Andy Lutomirski wrote:
> > > With SO_RESERVE_MEM, I can reserve memory, and it gets credited to
> > > sk_forward_alloc.  But, as far as I can tell, nothing keeps it from
> > > getting "reclaimed" (i.e. un-credited from sk_forward_alloc and
> > > uncharged).  So, for UDP at least, it basically doesn't work.
> >
> > SO_RESERVE_MEM is basically not implemented (yet) for UDP. Patches are
> > welcome - even if I would be curious about the use-case.
>
> I've been chasing UDP packet drops under circumstances where they
> really should not be happening.  I *think* something regressed between
> 6.2 and 6.5 (as I was seeing a lot of drops on a 6.5 machine and not
> on a 6.2 machine, and they're both under light load and subscribed to
> the same multicast group).  But regardless of whether there's an
> actual regression, the logic seems rather ancient and complex.  All I
> want is a reasonable size buffer, and I have plenty of memory.  (It's
> not 1995 any more -- I have many GB of RAM and I need a few tens of kB
> of buffer.)
>
> And, on inspection of the code, so far I've learned, in no particular order:
>
> 1. __sk_raise_mem_allocated() is called very frequently (as the
> sk_forward_alloc mechanism is not particularly effective at reserving
> memory).  And it calls sk_memory_allocated_add(), which looked
> suspiciously like a horrible scalability problem until this patch:
>
> commit 3cd3399dd7a84ada85cb839989cdf7310e302c7d
> Author: Eric Dumazet <edumazet@...gle.com>
> Date:   Wed Jun 8 23:34:09 2022 -0700
>
>     net: implement per-cpu reserves for memory_allocated
>
> and I suspect that the regression I'm chasing is here:
>
> commit 4890b686f4088c90432149bd6de567e621266fa2
> Author: Eric Dumazet <edumazet@...gle.com>
> Date:   Wed Jun 8 23:34:11 2022 -0700
>
>     net: keep sk->sk_forward_alloc as small as possible
>
> (My hypothesis is that, before this change, there would frequently be
> enough sk_forward_alloc left to at least drastically reduce the
> frequency of transient failures due to protocol memory limits.)
>
> 2. If a socket wants to use memory in excess of sk_forward_alloc
> (which is now very small) and rcvbuf is sufficient, it goes through
> __sk_mem_raise_allocated, and that has a whole lot of complex rules.
>
> 2a. It checks memcg.  This does page_counter_try_charge, which does an
> unconditional atomic add.  Possibly more than one.  Ouch.
>
> 2b. It checks *global* per-protocol memory limits, and the defaults
> are *small* by modern standards.  One slow program with a UDP socket
> and a big SO_RCVBUF can easily use all the UDP memory, for example.
> Also, why on Earth are we using global limits in a memcg world?
>
> 2c. The per-protocol limits really look buggy.  The code says:
>
>     if (sk_has_memory_pressure(sk)) {
>         u64 alloc;
>
>         if (!sk_under_memory_pressure(sk))
>             return 1;
>         alloc = sk_sockets_allocated_read_positive(sk);
>         if (sk_prot_mem_limits(sk, 2) > alloc *
>             sk_mem_pages(sk->sk_wmem_queued +
>                  atomic_read(&sk->sk_rmem_alloc) +
>                  sk->sk_forward_alloc))
>             return 1;
>     }
>
> <-- Surely there should be a return 1 here?!?
>
> suppress_allocation:
>
> That goes all the way back to:
>
> commit 3ab224be6d69de912ee21302745ea4
> 5a99274dbc
> Author: Hideo Aoki <haoki@...hat.com>
> Date:   Mon Dec 31 00:11:19 2007 -0800
>
>     [NET] CORE: Introducing new memory accounting interface.
>
> But it wasn't used for UDP back then, and I don't think the code path
> in question is or was reachable for TCP.
>
> 3. A failure in any of this stuff gives the same drop_reason.  IMO it
> would be really nice if the drop reason were split up into, say,
> RCVBUFF_MEMCG, RCVBUF_PROTO_HARDLIMIT, RCVBUFF_PROTO_PRESSURE or
> similar.
>
>
> And maybe there should be a way (a memcg option?) to just turn the
> protocol limits off entirely within a memcg.  Or to have per-memcg
> protocol limits.  Or something.  A single UDP socket in a different
> memcg should not be able to starve the whole system such that even
> just two (!) queued UDP datagrams don't fit in a receive queue.
>
>
> Anyway, SO_RESERVE_MEM looks like it ought to make enqueueing on that
> socket faster and more scalable.  And exempt from random failures due
> to protocol memory limits.

Yes, this was the goal, but so far only implemented for TCP.

Have you tried to bump SK_MEMORY_PCPU_RESERVE to a higher value ?

SK_MEMORY_PCPU_RESERVE has been set to a very conservative value,
based on TCP workloads.