netdev - Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250826002410.2608702-1-kuniyu@google.com>
Date: Tue, 26 Aug 2025 00:23:39 +0000
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: martin.lau@...ux.dev
Cc: almasrymina@...gle.com, andrii@...nel.org, ast@...nel.org, 
	bpf@...r.kernel.org, daniel@...earbox.net, davem@...emloft.net, 
	edumazet@...gle.com, hannes@...xchg.org, john.fastabend@...il.com, 
	kuba@...nel.org, kuni1840@...il.com, kuniyu@...gle.com, mhocko@...nel.org, 
	ncardwell@...gle.com, netdev@...r.kernel.org, pabeni@...hat.com, 
	roman.gushchin@...ux.dev, sdf@...ichev.me, shakeel.butt@...ux.dev, 
	willemb@...gle.com
Subject: Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().

From: Martin KaFai Lau <martin.lau@...ux.dev>
Date: Mon, 25 Aug 2025 16:14:35 -0700
> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
> > On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >>
> >> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
> >>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> >>> index ae83ecda3983..ab613abdfaa4 100644
> >>> --- a/net/ipv4/af_inet.c
> >>> +++ b/net/ipv4/af_inet.c
> >>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >>>                kmem_cache_charge(newsk, gfp);
> >>>        }
> >>>
> >>> +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
> >>> +
> >>>        if (mem_cgroup_sk_enabled(newsk)) {
> >>>                int amt;
> >>>
> >>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >>> index 233de8677382..80df246d4741 100644
> >>> --- a/tools/include/uapi/linux/bpf.h
> >>> +++ b/tools/include/uapi/linux/bpf.h
> >>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
> >>>        BPF_NETKIT_PEER,
> >>>        BPF_TRACE_KPROBE_SESSION,
> >>>        BPF_TRACE_UPROBE_SESSION,
> >>> +     BPF_CGROUP_INET_SOCK_ACCEPT,
> >>
> >> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
> >> inherited from the listener?
> > 
> > Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
> > accept() because the child socket could be created during irq context with
> > unrelated cgroup.  This had another reason; if the listener was created in the
> > root cgroup and passed to a process under cgroup, child sockets would never
> > have sk_memcg if sk_memcg was inherited.
> > 
> > So, the child's memcg is not always the same one with the listener's, and
> > we cannot rely on the listener's sk_memcg.
> 
> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit 
> the SK_BPF_MEMCG_SOCK_ISOLATED bit.

I didn't want to let the flag remain alone without accept() (error-prone
but works because we always check mem_cgroup_from_sk() before the bit)
and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we
don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB
or bpf_iter etc.

---8<---
diff --git a/net/core/filter.c b/net/core/filter.c
index a78356682442..9ef458fe706e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
 
 static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
 {
-	if (!mem_cgroup_sk_enabled(sk))
+	if (!sk_has_account(sk))
 		return -EOPNOTSUPP;
 
 	if (getopt) {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e92dfca0a0ff..efae15d04306 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 	if (mem_cgroup_sockets_enabled &&
 	    (!IS_ENABLED(CONFIG_IP_SCTP) ||
 	     sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
+		unsigned short flags = mem_cgroup_sk_get_flags(newsk);
+
 		mem_cgroup_sk_alloc(newsk);
+		mem_cgroup_sk_set_flags(newsk, flags);
 		kmem_cache_charge(newsk, gfp);
 	}
 
---8<---


> 
> If it can only be done at accept, there is already an existing 
> SEC("lsm_cgroup/socket_accept") hook. Take a look at 
> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't 
> have access to the "newsock->sk" but it should have access to the "sock->sk", do 
> bpf_setsockopt and then inherit by the newsock->sk (?)
> 
> There are already quite enough cgroup-sk style hooks. I would prefer not to add 
> another cgroup attach_type and instead see if some of the existing ones can be 
> reused. There is also SEC("lsm/sock_graft").

We need to do fixup below, so lsm_cgroup/socket_accept won't work
if we keep the code in __inet_accept().  We can move this after
lsm/sock_graft within __inet_accept().

if (mem_cgroup_sk_isolated(newsk))
	sk_memory_allocated_sub(newsk, amt);

But then, we cannot distinguish it with other hooks (lock_sock() &&
sk->sk_socket != NULL), and finally the fixup must be done dynamically
in setsockopt().

But I didn't want to place that in setsockopt() to avoid caring about
weird scenario that introduces unnecessary complexity.

e.g. bpf_setsockopt() fixes up once and a new data comes in before
accept() and we need to handle another fixup in accept() or close(),
in the latter case, we need to check the bit only even if sk_memcg
itself is NULL.