[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAVpQUBmhMzESPMOBzr=e+kORPZi5XCHR7_DgqtijsyKsmPviw@mail.gmail.com>
Date: Thu, 18 Sep 2025 19:43:34 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>,
Stanislav Fomichev <sdf@...ichev.me>, Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>, Shakeel Butt <shakeel.butt@...ux.dev>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Neal Cardwell <ncardwell@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
Mina Almasry <almasrymina@...gle.com>, Kuniyuki Iwashima <kuni1840@...il.com>, bpf@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
On Thu, Sep 18, 2025 at 7:28 PM Kuniyuki Iwashima <kuniyu@...gle.com> wrote:
>
> On Thu, Sep 18, 2025 at 6:14 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >
> > On 9/17/25 6:17 PM, Kuniyuki Iwashima wrote:
> > >>> +
> > >>> +close:
> > >>> + for (i = 0; i < ARRAY_SIZE(sk); i++)
> > >>> + close(sk[i]);
> > >>> +
> > >>> + if (test_case->type == SOCK_DGRAM) {
> > >>> + /* UDP recv queue is destroyed after RCU grace period.
> > >>> + * With one kern_sync_rcu(), memory_allocated[0] of the
> > >>> + * isoalted case often matches with memory_allocated[1]
> > >>> + * of the preceding non-exclusive case.
> > >>> + */
> > >> I don't think I understand the double kern_sync_rcu() below.
> > > With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
> > > I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
> > > still see the previous test case result (memory_allocated[1]).
> > > ASSERT_LE() succeeds as expected, but somewhat unintentionally.
> > >
> > > bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
> > > bpf_trace_printk: memory_allocated: 4160
> > > bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
> > > memory_allocated[0]
> > > bpf_trace_printk: memory_allocated: 0
> > > bpf_trace_printk: memory_allocated: 0
> > > bpf_trace_printk: memory_allocated: 0
> > >
> > > One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
> > > does not guarantee that it completes, so if the queue length was too long,
> > > the memory_allocated does not go down fast enough.
> > >
> > > But now I don't see the flakiness with NR_SEND 32, and one
> > > kern_sync_rcu() might be enough unless the env is too slow...?
> >
> > Ah, got it. I put you in the wrong path. It needs rcu_barrier() instead.
> >
> > Is recv() enough? May be just recv(MSG_DONTWAIT) to consume it only for UDP
> > socket? that will slow down the udp test... only read 1 byte and the remaining
> > can be MSG_TRUNC?
>
> recv() before close() seems to work with fewer sockets (with the same
> bytes of send()s and without kern_sync_rcu()).
>
> And the test time was mostly the same with "no recv() + 1 kern_sync_rcu()"
> case (2~2.5s on my qemu), so draining seems better.
Oh no, with NR_PAGES 128, "no recv() + 1kern_syn_rcu()" didn't
show the leftover at all, regardless of NR_SOCKETS==2 or 64.
But this might be only on my setup, and recv() before close()
could be stable on any setup theoretically.
fwiw, I put the latest code here.
https://github.com/q2ven/linux/commits/514_tcp_decouple_memcg
>
> #define NR_PAGES 128
> #define NR_SOCKETS 2
> #define BUF_TOTAL (NR_PAGES * 4096 / (NR_SOCKETS / 2))
> #define BUF_SINGLE 1024
> #define NR_SEND (BUF_TOTAL / BUF_SINGLE)
>
> NR_SOCKET==64
> test_progs-1380 [008] ...11 2121.731176: bpf_trace_printk:
> memory_allocated: 0
> test_progs-1380 [008] ...11 2121.737700: bpf_trace_printk:
> memory_allocated: 576
> test_progs-1380 [008] ...11 2121.743436: bpf_trace_printk:
> memory_allocated: 64
> test_progs-1380 [008] ...11 2121.749984: bpf_trace_printk:
> memory_allocated: 64
> test_progs-1380 [008] ...11 2121.755763: bpf_trace_printk:
> memory_allocated: 64
> test_progs-1380 [008] ...11 2121.762171: bpf_trace_printk:
> memory_allocated: 64
>
> NR_SOCKET==2
> test_progs-1424 [009] ...11 2352.990520: bpf_trace_printk:
> memory_allocated: 0
> test_progs-1424 [009] ...11 2352.997395: bpf_trace_printk:
> memory_allocated: 514
> test_progs-1424 [009] ...11 2353.001910: bpf_trace_printk:
> memory_allocated: 2
> test_progs-1424 [009] ...11 2353.008947: bpf_trace_printk:
> memory_allocated: 2
> test_progs-1424 [009] ...11 2353.012988: bpf_trace_printk:
> memory_allocated: 2
> test_progs-1424 [009] ...11 2353.019988: bpf_trace_printk:
> memory_allocated: 0
>
> While NR_PAGES sets to 128, the actual allocated page was around 300 for
> TCP and 500 for UDP. I reduced NR_PAGES to 64, then TCP consumed 160
> pages, and UDP consumed 250, but test time didn't change.
>
> >
> > btw, does the test need 64 sockets? is it because of the per socket snd/rcvbuf
> > limitation?
>
> I chose 64 for UDP that does not tune rcvbuf automatically, but I saw an error
> when I increased the number of send() bytes and set SO_RCVBUF anyway,
> so any number of sockets is fine now.
>
> I'll use 2 sockets but will keep for-loops so that we can change NR_SOCKETS
> easily when the test gets flaky.
>
>
> >
> > Another option is to trace SEC("fexit/__sk_destruct") to ensure all the cleanup
> > is done but seems overkill if recv() can do.
>
> Agreed, draining before close() would be enough.
Powered by blists - more mailing lists