netdev - Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAVpQUBmhMzESPMOBzr=e+kORPZi5XCHR7_DgqtijsyKsmPviw@mail.gmail.com>
Date: Thu, 18 Sep 2025 19:43:34 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, 
	Stanislav Fomichev <sdf@...ichev.me>, Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Shakeel Butt <shakeel.butt@...ux.dev>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Neal Cardwell <ncardwell@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Mina Almasry <almasrymina@...gle.com>, Kuniyuki Iwashima <kuni1840@...il.com>, bpf@...r.kernel.org, 
	netdev@...r.kernel.org
Subject: Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.

On Thu, Sep 18, 2025 at 7:28 PM Kuniyuki Iwashima <kuniyu@...gle.com> wrote:
>
> On Thu, Sep 18, 2025 at 6:14 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >
> > On 9/17/25 6:17 PM, Kuniyuki Iwashima wrote:
> > >>> +
> > >>> +close:
> > >>> +     for (i = 0; i < ARRAY_SIZE(sk); i++)
> > >>> +             close(sk[i]);
> > >>> +
> > >>> +     if (test_case->type == SOCK_DGRAM) {
> > >>> +             /* UDP recv queue is destroyed after RCU grace period.
> > >>> +              * With one kern_sync_rcu(), memory_allocated[0] of the
> > >>> +              * isoalted case often matches with memory_allocated[1]
> > >>> +              * of the preceding non-exclusive case.
> > >>> +              */
> > >> I don't think I understand the double kern_sync_rcu() below.
> > > With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
> > > I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
> > > still see the previous test case result (memory_allocated[1]).
> > > ASSERT_LE() succeeds as expected, but somewhat unintentionally.
> > >
> > > bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
> > > bpf_trace_printk: memory_allocated: 4160
> > > bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
> > > memory_allocated[0]
> > > bpf_trace_printk: memory_allocated: 0
> > > bpf_trace_printk: memory_allocated: 0
> > > bpf_trace_printk: memory_allocated: 0
> > >
> > > One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
> > > does not guarantee that it completes, so if the queue length was too long,
> > > the memory_allocated does not go down fast enough.
> > >
> > > But now I don't see the flakiness with NR_SEND 32, and one
> > > kern_sync_rcu() might be enough unless the env is too slow...?
> >
> > Ah, got it. I put you in the wrong path. It needs rcu_barrier() instead.
> >
> > Is recv() enough? May be just recv(MSG_DONTWAIT) to consume it only for UDP
> > socket? that will slow down the udp test... only read 1 byte and the remaining
> > can be MSG_TRUNC?
>
> recv() before close() seems to work with fewer sockets (with the same
> bytes of send()s and without kern_sync_rcu()).
>
> And the test time was mostly the same with "no recv() + 1 kern_sync_rcu()"
> case (2~2.5s on my qemu), so draining seems better.

Oh no, with NR_PAGES 128, "no recv() + 1kern_syn_rcu()" didn't
show the leftover at all, regardless of NR_SOCKETS==2 or 64.
But this might be only on my setup, and recv() before close()
could be stable on any setup theoretically.

fwiw, I put the latest code here.
https://github.com/q2ven/linux/commits/514_tcp_decouple_memcg

>
> #define NR_PAGES        128
> #define NR_SOCKETS      2
> #define BUF_TOTAL       (NR_PAGES * 4096 / (NR_SOCKETS / 2))
> #define BUF_SINGLE      1024
> #define NR_SEND         (BUF_TOTAL / BUF_SINGLE)
>
> NR_SOCKET==64
>       test_progs-1380    [008] ...11  2121.731176: bpf_trace_printk:
> memory_allocated: 0
>       test_progs-1380    [008] ...11  2121.737700: bpf_trace_printk:
> memory_allocated: 576
>       test_progs-1380    [008] ...11  2121.743436: bpf_trace_printk:
> memory_allocated: 64
>       test_progs-1380    [008] ...11  2121.749984: bpf_trace_printk:
> memory_allocated: 64
>       test_progs-1380    [008] ...11  2121.755763: bpf_trace_printk:
> memory_allocated: 64
>       test_progs-1380    [008] ...11  2121.762171: bpf_trace_printk:
> memory_allocated: 64
>
> NR_SOCKET==2
>       test_progs-1424    [009] ...11  2352.990520: bpf_trace_printk:
> memory_allocated: 0
>       test_progs-1424    [009] ...11  2352.997395: bpf_trace_printk:
> memory_allocated: 514
>       test_progs-1424    [009] ...11  2353.001910: bpf_trace_printk:
> memory_allocated: 2
>       test_progs-1424    [009] ...11  2353.008947: bpf_trace_printk:
> memory_allocated: 2
>       test_progs-1424    [009] ...11  2353.012988: bpf_trace_printk:
> memory_allocated: 2
>       test_progs-1424    [009] ...11  2353.019988: bpf_trace_printk:
> memory_allocated: 0
>
> While NR_PAGES sets to 128, the actual allocated page was around 300 for
> TCP and 500 for UDP.  I reduced NR_PAGES to 64, then TCP consumed 160
> pages, and UDP consumed 250, but test time didn't change.
>
> >
> > btw, does the test need 64 sockets? is it because of the per socket snd/rcvbuf
> > limitation?
>
> I chose 64 for UDP that does not tune rcvbuf automatically, but I saw an error
> when I increased the number of send() bytes and set SO_RCVBUF anyway,
> so any number of sockets is fine now.
>
> I'll use 2 sockets but will keep for-loops so that we can change NR_SOCKETS
> easily when the test gets flaky.
>
>
> >
> > Another option is to trace SEC("fexit/__sk_destruct") to ensure all the cleanup
> > is done but seems overkill if recv() can do.
>
> Agreed, draining before close() would be enough.