netdev - Re: SO_RESERVE_MEM doesn't quite work, at least on UDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrVwAT39fM89O0BqW9KAVfOFQo590g-Zs6mt+yAkoCvZZQ@mail.gmail.com>
Date: Fri, 16 Feb 2024 10:00:17 -0800
From: Andy Lutomirski <luto@...capital.net>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Wei Wang <weiwan@...gle.com>, Network Development <netdev@...r.kernel.org>, 
	Eric Dumazet <edumazet@...gle.com>, "David S. Miller" <davem@...emloft.net>
Subject: Re: SO_RESERVE_MEM doesn't quite work, at least on UDP

On Fri, Feb 16, 2024 at 12:11 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Thu, 2024-02-15 at 13:17 -0800, Andy Lutomirski wrote:
> > With SO_RESERVE_MEM, I can reserve memory, and it gets credited to
> > sk_forward_alloc.  But, as far as I can tell, nothing keeps it from
> > getting "reclaimed" (i.e. un-credited from sk_forward_alloc and
> > uncharged).  So, for UDP at least, it basically doesn't work.
>
> SO_RESERVE_MEM is basically not implemented (yet) for UDP. Patches are
> welcome - even if I would be curious about the use-case.

I've been chasing UDP packet drops under circumstances where they
really should not be happening.  I *think* something regressed between
6.2 and 6.5 (as I was seeing a lot of drops on a 6.5 machine and not
on a 6.2 machine, and they're both under light load and subscribed to
the same multicast group).  But regardless of whether there's an
actual regression, the logic seems rather ancient and complex.  All I
want is a reasonable size buffer, and I have plenty of memory.  (It's
not 1995 any more -- I have many GB of RAM and I need a few tens of kB
of buffer.)

And, on inspection of the code, so far I've learned, in no particular order:

1. __sk_raise_mem_allocated() is called very frequently (as the
sk_forward_alloc mechanism is not particularly effective at reserving
memory).  And it calls sk_memory_allocated_add(), which looked
suspiciously like a horrible scalability problem until this patch:

commit 3cd3399dd7a84ada85cb839989cdf7310e302c7d
Author: Eric Dumazet <edumazet@...gle.com>
Date:   Wed Jun 8 23:34:09 2022 -0700

    net: implement per-cpu reserves for memory_allocated

and I suspect that the regression I'm chasing is here:

commit 4890b686f4088c90432149bd6de567e621266fa2
Author: Eric Dumazet <edumazet@...gle.com>
Date:   Wed Jun 8 23:34:11 2022 -0700

    net: keep sk->sk_forward_alloc as small as possible

(My hypothesis is that, before this change, there would frequently be
enough sk_forward_alloc left to at least drastically reduce the
frequency of transient failures due to protocol memory limits.)

2. If a socket wants to use memory in excess of sk_forward_alloc
(which is now very small) and rcvbuf is sufficient, it goes through
__sk_mem_raise_allocated, and that has a whole lot of complex rules.

2a. It checks memcg.  This does page_counter_try_charge, which does an
unconditional atomic add.  Possibly more than one.  Ouch.

2b. It checks *global* per-protocol memory limits, and the defaults
are *small* by modern standards.  One slow program with a UDP socket
and a big SO_RCVBUF can easily use all the UDP memory, for example.
Also, why on Earth are we using global limits in a memcg world?

2c. The per-protocol limits really look buggy.  The code says:

    if (sk_has_memory_pressure(sk)) {
        u64 alloc;

        if (!sk_under_memory_pressure(sk))
            return 1;
        alloc = sk_sockets_allocated_read_positive(sk);
        if (sk_prot_mem_limits(sk, 2) > alloc *
            sk_mem_pages(sk->sk_wmem_queued +
                 atomic_read(&sk->sk_rmem_alloc) +
                 sk->sk_forward_alloc))
            return 1;
    }

<-- Surely there should be a return 1 here?!?

suppress_allocation:

That goes all the way back to:

commit 3ab224be6d69de912ee21302745ea4
5a99274dbc
Author: Hideo Aoki <haoki@...hat.com>
Date:   Mon Dec 31 00:11:19 2007 -0800

    [NET] CORE: Introducing new memory accounting interface.

But it wasn't used for UDP back then, and I don't think the code path
in question is or was reachable for TCP.

3. A failure in any of this stuff gives the same drop_reason.  IMO it
would be really nice if the drop reason were split up into, say,
RCVBUFF_MEMCG, RCVBUF_PROTO_HARDLIMIT, RCVBUFF_PROTO_PRESSURE or
similar.

And maybe there should be a way (a memcg option?) to just turn the
protocol limits off entirely within a memcg.  Or to have per-memcg
protocol limits.  Or something.  A single UDP socket in a different
memcg should not be able to starve the whole system such that even
just two (!) queued UDP datagrams don't fit in a receive queue.

Anyway, SO_RESERVE_MEM looks like it ought to make enqueueing on that
socket faster and more scalable.  And exempt from random failures due
to protocol memory limits.

--Andy