lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAVpQUCoMkkGrPfqiL4C_3i5EG_THaYb0gT+qF7jyxreBJTSZw@mail.gmail.com>
Date: Wed, 17 Sep 2025 18:17:13 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, 
	Stanislav Fomichev <sdf@...ichev.me>, Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Shakeel Butt <shakeel.butt@...ux.dev>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Neal Cardwell <ncardwell@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Mina Almasry <almasrymina@...gle.com>, Kuniyuki Iwashima <kuni1840@...il.com>, bpf@...r.kernel.org, 
	netdev@...r.kernel.org
Subject: Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.

On Wed, Sep 17, 2025 at 4:38 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 9/17/25 12:14 PM, Kuniyuki Iwashima wrote:
> > The test does the following for IPv4/IPv6 x TCP/UDP sockets
> > with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
> > net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).
> >
> >    1. Create socket pairs
> >    2. Send a bunch of data that requires more than 1024 pages
> >    3. Read memory_allocated from sk->sk_prot->memory_allocated and
> >       sk->sk_prot->memory_per_cpu_fw_alloc
> >    4. Check if unread data is charged to memory_allocated
> >
> > If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
> > changed, but we allow a small error (up to 10 pages) in case
> > other processes on the host use some amounts of TCP/UDP memory.
> >
> > The amount of allocated pages are buffered to per-cpu variable
> > {tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
> > before reported to {tcp,udp}_memory_allocated.
> >
> > At 3., memory_allocated is calculated from the 2 variables twice
> > at fentry and fexit of socket create function to check if the per-cpu
> > value is drained during calculation.  In that case, 3. is retried.
> >
> > We use kern_sync_rcu() for UDP because UDP recv queue is destroyed
> > after RCU grace period.
> >
> > The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM.
> >
> >    # time ./test_progs -t sk_memcg
> >    #370/1   sk_memcg/TCP  :OK
> >    #370/2   sk_memcg/UDP  :OK
> >    #370/3   sk_memcg/TCPv6:OK
> >    #370/4   sk_memcg/UDPv6:OK
> >    #370     sk_memcg:OK
> >    Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
> >
> >    real       0m1.623s
> >    user       0m0.165s
> >    sys        0m0.366s
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@...gle.com>
> > ---
> > v7:
> >    * Add test for sysctl
> >
> > v6:
> >    * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc
> >
> > v5:
> >    * Use kern_sync_rcu()
> >    * Double NR_SEND to 128
> >
> > v4:
> >    * Only use inet_create() hook
> >    * Test bpf_getsockopt()
> >    * Add serial_ prefix
> >    * Reduce sleep() and the amount of sent data
> > ---
> >   .../selftests/bpf/prog_tests/sk_memcg.c       | 261 ++++++++++++++++++
> >   tools/testing/selftests/bpf/progs/sk_memcg.c  | 146 ++++++++++
> >   2 files changed, 407 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> > new file mode 100644
> > index 000000000000..777fb81e9365
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> > @@ -0,0 +1,261 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright 2025 Google LLC */
> > +
> > +#include <test_progs.h>
> > +#include "sk_memcg.skel.h"
> > +#include "network_helpers.h"
> > +
> > +#define NR_SOCKETS   64
> > +#define NR_SEND              128
> > +#define BUF_SINGLE   1024
> > +#define BUF_TOTAL    (BUF_SINGLE * NR_SEND)
> > +
> > +struct test_case {
> > +     char name[8];
> > +     int family;
> > +     int type;
> > +     int (*create_sockets)(struct test_case *test_case, int sk[], int len);
> > +     long (*get_memory_allocated)(struct test_case *test_case, struct sk_memcg *skel);
> > +};
> > +
> > +static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
> > +{
> > +     int server, i;
> > +
> > +     server = start_server(test_case->family, test_case->type, NULL, 0, 0);
> > +     ASSERT_GE(server, 0, "start_server_str");
> > +
> > +     for (i = 0; i < len / 2; i++) {
> > +             sk[i * 2] = connect_to_fd(server, 0);
> > +             if (!ASSERT_GE(sk[i * 2], 0, "connect_to_fd"))
> > +                     return sk[i * 2];
> > +
> > +             sk[i * 2 + 1] = accept(server, NULL, NULL);
> > +             if (!ASSERT_GE(sk[i * 2 + 1], 0, "accept"))
> > +                     return sk[i * 2 + 1];
> > +     }
> > +
> > +     close(server);
> > +
> > +     return 0;
> > +}
> > +
> > +static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
> > +{
> > +     int i, err, rcvbuf = BUF_TOTAL;
> > +
> > +     for (i = 0; i < len / 2; i++) {
>
> nit. How about "for (i = 0; i < len; i += 2) {" once here instead of "i * 2"
> below. Same for the tcp_create_sockets() above.

Will do.


>
> > +             sk[i * 2] = start_server(test_case->family, test_case->type, NULL, 0, 0);
> > +             if (!ASSERT_GE(sk[i * 2], 0, "start_server"))
> > +                     return sk[i * 2];
> > +
> > +             sk[i * 2 + 1] = connect_to_fd(sk[i * 2], 0);
> > +             if (!ASSERT_GE(sk[i * 2 + 1], 0, "connect_to_fd"))
> > +                     return sk[i * 2 + 1];
> > +
> > +             err = connect_fd_to_fd(sk[i * 2], sk[i * 2 + 1], 0);
> > +             if (!ASSERT_EQ(err, 0, "connect_fd_to_fd"))
> > +                     return err;
> > +
> > +             err = setsockopt(sk[i * 2], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> > +             if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
> > +                     return err;
> > +
> > +             err = setsockopt(sk[i * 2 + 1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> > +             if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
> > +                     return err;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static long get_memory_allocated(struct test_case *test_case,
> > +                              bool *activated, bool *stable,
> > +                              long *memory_allocated)
> > +{
> > +     *stable = false;
> > +
> > +     do {
> > +             *activated = true;
> > +
> > +             /* AF_INET and AF_INET6 share the same memory_allocated.
> > +              * tcp_init_sock() is called by AF_INET and AF_INET6,
> > +              * but udp_lib_init_sock() is inline.
> > +              */
> > +             socket(AF_INET, test_case->type, 0);
>
> fd is leaked.

Will close().


>
> > +     } while (!*stable);
>
> cannot loop forever. The test needs to assume the machine is relatively network
> quiet anyway (so serial_). Things can still change after the stable test also. I
> think having a way (the fentry in the progs/sk_memcg.c) to account for the
> percpu fw alloc is good enough, and this should help if there is some light
> background traffic that suddenly flush the hidden +255 percpu counter to the
> global one and another percpu counter still has a -254 for example.

Okay, I'll remove fexit progs.


>
> > +
> > +     return *memory_allocated;
> > +}
> > +
> > +static long tcp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
> > +{
> > +     return get_memory_allocated(test_case,
> > +                                 &skel->bss->tcp_activated,
> > +                                 &skel->bss->tcp_stable,
> > +                                 &skel->bss->tcp_memory_allocated);
> > +}
> > +
> > +static long udp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
> > +{
> > +     return get_memory_allocated(test_case,
> > +                                 &skel->bss->udp_activated,
> > +                                 &skel->bss->udp_stable,
> > +                                 &skel->bss->udp_memory_allocated);
> > +}
> > +
> > +static int check_exclusive(struct test_case *test_case,
> > +                        struct sk_memcg *skel, bool exclusive)
> > +{
> > +     char buf[BUF_SINGLE] = {};
> > +     long memory_allocated[2];
> > +     int sk[NR_SOCKETS] = {};
> > +     int err, i, j;
> > +
> > +     err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
> > +     if (err)
> > +             goto close;
> > +
> > +     memory_allocated[0] = test_case->get_memory_allocated(test_case, skel);
> > +
> > +     /* allocate pages >= 1024 */
> > +     for (i = 0; i < ARRAY_SIZE(sk); i++) {
> > +             for (j = 0; j < NR_SEND; j++) {
> > +                     int bytes = send(sk[i], buf, sizeof(buf), 0);
> > +
> > +                     /* Avoid too noisy logs when something failed. */
> > +                     if (bytes != sizeof(buf)) {
> > +                             ASSERT_EQ(bytes, sizeof(buf), "send");
> > +                             if (bytes < 0) {
> > +                                     err = bytes;
> > +                                     goto close;
> > +                             }
> > +                     }
> > +             }
> > +     }
> > +
> > +     memory_allocated[1] = test_case->get_memory_allocated(test_case, skel);
> > +
> > +     if (exclusive)
> > +             ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "exclusive");
> > +     else
> > +             ASSERT_GT(memory_allocated[1], memory_allocated[0] + 1024, "not exclusive");The test is taking >10s in my environemnt. Although it has kasan and other dbg
> turned on, my environment is not a slow one tbh. The WATCHDOG > 10s warning is
> hit pretty often. The exclusive case is expecting +10. May be we just need to
> check +128 for non-exclusive which should be subtle enough to contrast with the
> exclusive case? With +128, NR_SEND 32 is more than enough?

I think I was too conservative, and the number should be
fine with fentry progs.


>
> > +
> > +close:
> > +     for (i = 0; i < ARRAY_SIZE(sk); i++)
> > +             close(sk[i]);
> > +
> > +     if (test_case->type == SOCK_DGRAM) {
> > +             /* UDP recv queue is destroyed after RCU grace period.
> > +              * With one kern_sync_rcu(), memory_allocated[0] of the
> > +              * isoalted case often matches with memory_allocated[1]
> > +              * of the preceding non-exclusive case.
> > +              */
>
> I don't think I understand the double kern_sync_rcu() below.

With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
still see the previous test case result (memory_allocated[1]).
ASSERT_LE() succeeds as expected, but somewhat unintentionally.

bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
bpf_trace_printk: memory_allocated: 4160
bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
memory_allocated[0]
bpf_trace_printk: memory_allocated: 0
bpf_trace_printk: memory_allocated: 0
bpf_trace_printk: memory_allocated: 0

One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
does not guarantee that it completes, so if the queue length was too long,
the memory_allocated does not go down fast enough.

But now I don't see the flakiness with NR_SEND 32, and one
kern_sync_rcu() might be enough unless the env is too slow...?



>
> > +             kern_sync_rcu();
> > +             kern_sync_rcu();> +     }
> > +
> > +     return err;
> > +}
> > +
> > +void run_test(struct test_case *test_case)
>
> static

Will add.


>
> > +{
> > +     struct nstoken *nstoken;
> > +     struct sk_memcg *skel;
> > +     int cgroup, err;
> > +
> > +     skel = sk_memcg__open_and_load();
> > +     if (!ASSERT_OK_PTR(skel, "open_and_load"))
> > +             return;
> > +
> > +     skel->bss->nr_cpus = libbpf_num_possible_cpus();
> > +
> > +     err = sk_memcg__attach(skel);
> > +     if (!ASSERT_OK(err, "attach"))
> > +             goto destroy_skel;
> > +
> > +     cgroup = test__join_cgroup("/sk_memcg");
> > +     if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
> > +             goto destroy_skel;
> > +
> > +     err = make_netns("sk_memcg");
> > +     if (!ASSERT_EQ(err, 0, "make_netns"))
> > +             goto close_cgroup;
> > +
> > +     nstoken = open_netns("sk_memcg");
> > +     if (!ASSERT_OK_PTR(nstoken, "open_netns"))
> > +             goto remove_netns;
> > +
> > +     err = check_exclusive(test_case, skel, false);
> > +     if (!ASSERT_EQ(err, 0, "test_exclusive(false)"))
> > +             goto close_netns;
> > +
> > +     err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "1");
> > +     if (!ASSERT_EQ(err, 0, "write_sysctl(1)"))
> > +             goto close_netns;
> > +
> > +     err = check_exclusive(test_case, skel, true);
> > +     if (!ASSERT_EQ(err, 0, "test_exclusive(true by sysctl)"))
> > +             goto close_netns;
> > +
> > +     err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "0");
> > +     if (!ASSERT_EQ(err, 0, "write_sysctl(0)"))
> > +             goto close_netns;
> > +
> > +     skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
> > +     if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
> > +             goto close_netns;
> > +
> > +     err = check_exclusive(test_case, skel, true);
> > +     ASSERT_EQ(err, 0, "test_exclusive(true by bpf)");
> > +
> > +close_netns:
> > +     close_netns(nstoken);
> > +remove_netns:
> > +     remove_netns("sk_memcg");
> > +close_cgroup:
> > +     close(cgroup);
> > +destroy_skel:
> > +     sk_memcg__destroy(skel);
> > +}
> > +
> > +struct test_case test_cases[] = {
> > +     {
> > +             .name = "TCP  ",
> > +             .family = AF_INET,
> > +             .type = SOCK_STREAM,
> > +             .create_sockets = tcp_create_sockets,
> > +             .get_memory_allocated = tcp_get_memory_allocated,
> > +     },
> > +     {
> > +             .name = "UDP  ",
> > +             .family = AF_INET,
> > +             .type = SOCK_DGRAM,
> > +             .create_sockets = udp_create_sockets,
> > +             .get_memory_allocated = udp_get_memory_allocated,
> > +     },
> > +     {
> > +             .name = "TCPv6",
> > +             .family = AF_INET6,
> > +             .type = SOCK_STREAM,
> > +             .create_sockets = tcp_create_sockets,
> > +             .get_memory_allocated = tcp_get_memory_allocated,
> > +     },
> > +     {
> > +             .name = "UDPv6",
> > +             .family = AF_INET6,
> > +             .type = SOCK_DGRAM,
> > +             .create_sockets = udp_create_sockets,
> > +             .get_memory_allocated = udp_get_memory_allocated,
> > +     },
> > +};
> > +
> > +void serial_test_sk_memcg(void)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> > +             test__start_subtest(test_cases[i].name);
>
> This is not doing anything without "if".

Ah, will move run_test() inside the if.

>
> > +             run_test(&test_cases[i]);
> > +     }
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
> > new file mode 100644
> > index 000000000000..6b1a928a0c90
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
> > @@ -0,0 +1,146 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright 2025 Google LLC */
> > +
> > +#include "bpf_tracing_net.h"
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include <errno.h>
> > +
> > +extern int tcp_memory_per_cpu_fw_alloc __ksym;
> > +extern int udp_memory_per_cpu_fw_alloc __ksym;
> > +
> > +int nr_cpus;
> > +bool tcp_activated, tcp_stable, udp_activated, udp_stable;
> > +long tcp_memory_allocated, udp_memory_allocated;
> > +static struct sock *tcp_sk_tracing, *udp_sk_tracing;
> > +
> > +struct sk_prot {
> > +     long *memory_allocated;
> > +     int *memory_per_cpu_fw_alloc;
> > +};
> > +
> > +static int drain_memory_per_cpu_fw_alloc(__u32 i, struct sk_prot *sk_prot_ctx)
> > +{
> > +     int *memory_per_cpu_fw_alloc;
> > +
> > +     memory_per_cpu_fw_alloc = bpf_per_cpu_ptr(sk_prot_ctx->memory_per_cpu_fw_alloc, i);
> > +     if (memory_per_cpu_fw_alloc)
> > +             *sk_prot_ctx->memory_allocated += *memory_per_cpu_fw_alloc;
> > +
> > +     return 0;
> > +}
> > +
> > +static long get_memory_allocated(struct sock *_sk, int *memory_per_cpu_fw_alloc)
> > +{
> > +     struct sock *sk = bpf_core_cast(_sk, struct sock);
> > +     struct sk_prot sk_prot_ctx;
> > +     long memory_allocated;
> > +
> > +     /* net_aligned_data.{tcp,udp}_memory_allocated was not available. */
> > +     memory_allocated = sk->__sk_common.skc_prot->memory_allocated->counter;
> > +
> > +     sk_prot_ctx.memory_allocated = &memory_allocated;
> > +     sk_prot_ctx.memory_per_cpu_fw_alloc = memory_per_cpu_fw_alloc;
> > +
> > +     bpf_loop(nr_cpus, drain_memory_per_cpu_fw_alloc, &sk_prot_ctx, 0);
> > +
> > +     return memory_allocated;
> > +}
> > +
> > +static void fentry_init_sock(struct sock *sk, struct sock **sk_tracing,
> > +                          long *memory_allocated, int *memory_per_cpu_fw_alloc,
> > +                          bool *activated)
> > +{
> > +     if (!*activated)
> > +             return;
> > +
> > +     if (__sync_val_compare_and_swap(sk_tracing, NULL, sk))
> > +             return;
> > +
> > +     *activated = false;
> > +     *memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
> > +}
> > +
> > +static void fexit_init_sock(struct sock *sk, struct sock **sk_tracing,
> > +                         long *memory_allocated, int *memory_per_cpu_fw_alloc,
> > +                         bool *stable)
> > +{
> > +     long new_memory_allocated;
> > +
> > +     if (sk != *sk_tracing)
> > +             return;
> > +
> > +     new_memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
> > +     if (new_memory_allocated == *memory_allocated)
> > +             *stable = true;
>
> I am not sure that help. The total memory_allocated can still change after this.
> I would just grab the total in fentry once and then move on without confirming
> in fexit.

Will remove fexit stuff.

Thanks!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ