[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVwAT39fM89O0BqW9KAVfOFQo590g-Zs6mt+yAkoCvZZQ@mail.gmail.com>
Date: Fri, 16 Feb 2024 10:00:17 -0800
From: Andy Lutomirski <luto@...capital.net>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Wei Wang <weiwan@...gle.com>, Network Development <netdev@...r.kernel.org>,
Eric Dumazet <edumazet@...gle.com>, "David S. Miller" <davem@...emloft.net>
Subject: Re: SO_RESERVE_MEM doesn't quite work, at least on UDP
On Fri, Feb 16, 2024 at 12:11 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Thu, 2024-02-15 at 13:17 -0800, Andy Lutomirski wrote:
> > With SO_RESERVE_MEM, I can reserve memory, and it gets credited to
> > sk_forward_alloc. But, as far as I can tell, nothing keeps it from
> > getting "reclaimed" (i.e. un-credited from sk_forward_alloc and
> > uncharged). So, for UDP at least, it basically doesn't work.
>
> SO_RESERVE_MEM is basically not implemented (yet) for UDP. Patches are
> welcome - even if I would be curious about the use-case.
I've been chasing UDP packet drops under circumstances where they
really should not be happening. I *think* something regressed between
6.2 and 6.5 (as I was seeing a lot of drops on a 6.5 machine and not
on a 6.2 machine, and they're both under light load and subscribed to
the same multicast group). But regardless of whether there's an
actual regression, the logic seems rather ancient and complex. All I
want is a reasonable size buffer, and I have plenty of memory. (It's
not 1995 any more -- I have many GB of RAM and I need a few tens of kB
of buffer.)
And, on inspection of the code, so far I've learned, in no particular order:
1. __sk_raise_mem_allocated() is called very frequently (as the
sk_forward_alloc mechanism is not particularly effective at reserving
memory). And it calls sk_memory_allocated_add(), which looked
suspiciously like a horrible scalability problem until this patch:
commit 3cd3399dd7a84ada85cb839989cdf7310e302c7d
Author: Eric Dumazet <edumazet@...gle.com>
Date: Wed Jun 8 23:34:09 2022 -0700
net: implement per-cpu reserves for memory_allocated
and I suspect that the regression I'm chasing is here:
commit 4890b686f4088c90432149bd6de567e621266fa2
Author: Eric Dumazet <edumazet@...gle.com>
Date: Wed Jun 8 23:34:11 2022 -0700
net: keep sk->sk_forward_alloc as small as possible
(My hypothesis is that, before this change, there would frequently be
enough sk_forward_alloc left to at least drastically reduce the
frequency of transient failures due to protocol memory limits.)
2. If a socket wants to use memory in excess of sk_forward_alloc
(which is now very small) and rcvbuf is sufficient, it goes through
__sk_mem_raise_allocated, and that has a whole lot of complex rules.
2a. It checks memcg. This does page_counter_try_charge, which does an
unconditional atomic add. Possibly more than one. Ouch.
2b. It checks *global* per-protocol memory limits, and the defaults
are *small* by modern standards. One slow program with a UDP socket
and a big SO_RCVBUF can easily use all the UDP memory, for example.
Also, why on Earth are we using global limits in a memcg world?
2c. The per-protocol limits really look buggy. The code says:
if (sk_has_memory_pressure(sk)) {
u64 alloc;
if (!sk_under_memory_pressure(sk))
return 1;
alloc = sk_sockets_allocated_read_positive(sk);
if (sk_prot_mem_limits(sk, 2) > alloc *
sk_mem_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) +
sk->sk_forward_alloc))
return 1;
}
<-- Surely there should be a return 1 here?!?
suppress_allocation:
That goes all the way back to:
commit 3ab224be6d69de912ee21302745ea4
5a99274dbc
Author: Hideo Aoki <haoki@...hat.com>
Date: Mon Dec 31 00:11:19 2007 -0800
[NET] CORE: Introducing new memory accounting interface.
But it wasn't used for UDP back then, and I don't think the code path
in question is or was reachable for TCP.
3. A failure in any of this stuff gives the same drop_reason. IMO it
would be really nice if the drop reason were split up into, say,
RCVBUFF_MEMCG, RCVBUF_PROTO_HARDLIMIT, RCVBUFF_PROTO_PRESSURE or
similar.
And maybe there should be a way (a memcg option?) to just turn the
protocol limits off entirely within a memcg. Or to have per-memcg
protocol limits. Or something. A single UDP socket in a different
memcg should not be able to starve the whole system such that even
just two (!) queued UDP datagrams don't fit in a receive queue.
Anyway, SO_RESERVE_MEM looks like it ought to make enqueueing on that
socket faster and more scalable. And exempt from random failures due
to protocol memory limits.
--Andy
Powered by blists - more mailing lists