netdev - Re: [PATCH v3 bpf-next/net 3/5] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9919daa2-b6e0-4d5c-b349-39b055eaaed2@linux.dev>
Date: Wed, 27 Aug 2025 16:48:07 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Kuniyuki Iwashima <kuniyu@...gle.com>
Cc: Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>,
 John Fastabend <john.fastabend@...il.com>,
 Stanislav Fomichev <sdf@...ichev.me>, Johannes Weiner <hannes@...xchg.org>,
 Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Shakeel Butt <shakeel.butt@...ux.dev>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Neal Cardwell <ncardwell@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
 Mina Almasry <almasrymina@...gle.com>, Kuniyuki Iwashima
 <kuni1840@...il.com>, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH v3 bpf-next/net 3/5] bpf: Introduce SK_BPF_MEMCG_FLAGS and
 SK_BPF_MEMCG_SOCK_ISOLATED.

On 8/26/25 11:38 AM, Kuniyuki Iwashima wrote:
> The main targets are BPF_CGROUP_INET_SOCK_CREATE and
> BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB as demonstrated in the selftest.
> 
> Note that we do not support modifying the flag once sk->sk_memcg is set
> especially because UDP charges memory under sk->sk_receive_queue.lock
> instead of lock_sock().
> 

[ ... ]

> diff --git a/include/net/sock.h b/include/net/sock.h
> index 63a6a48afb48..d41a2f8f8b30 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2596,10 +2596,39 @@ static inline gfp_t gfp_memcg_charge(void)
>   	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
>   }
>   
> +#define SK_BPF_MEMCG_FLAG_MASK	(SK_BPF_MEMCG_FLAG_MAX - 1)
> +#define SK_BPF_MEMCG_PTR_MASK	~SK_BPF_MEMCG_FLAG_MASK
> +
>   #ifdef CONFIG_MEMCG
> +static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
> +{
> +	unsigned long val = (unsigned long)sk->sk_memcg;
> +
> +	val |= flags;
> +	sk->sk_memcg = (struct mem_cgroup *)val;
> +}
> +

[ ... ]

> +static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
> +{
> +	if (!sk_has_account(sk))
> +		return -EOPNOTSUPP;
> +
> +	if (getopt) {
> +		*optval = mem_cgroup_sk_get_flags(sk);
> +		return 0;
> +	}
> +
> +	if (sock_owned_by_user_nocheck(sk) && mem_cgroup_from_sk(sk))

I still don't see how this can stop some of the bh bpf prog to set/clear the bit 
after mem_cgroup_sk_alloc() is done. For example, in BPF_SOCK_OPS_RETRANS_CB, 
bh_lock_sock() is held but not owned by user, so the bpf prog can continue to 
change the SK_BPF_MEMCG_FLAGS. What am I missing?

Passing some more args to __bpf_setsockopt() for caller context purpose is quite 
ugly which I think you also mentioned/tried earlier. May be it can check "if 
(in_softirq() && mem_cgroup_from_sk(sk)) return -EBUSY" but looks quite tricky 
together with the sock_owned_by_user_nocheck().

It seems this sk_memcg bit change is only needed for sock_ops hook and create 
hook? sock_ops already has its only func_proto bpf_sock_ops_setsockopt(). 
bpf_sock_ops_setsockopt can directly call sk_bpf_set_get_memcg_flags() but limit 
it to the BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB. Is it also safe to allow it for 
BPF_SOCK_OPS_TCP_LISTEN_CB? For the create hook, create a new func proto which 
can directly call sk_bpf_set_get_memcg_flags(). wdyt?

> +		return -EBUSY;
> +
> +	if (*optval <= 0 || *optval >= SK_BPF_MEMCG_FLAG_MAX)

It seems the bit cannot be cleared (the *optval <= 0 check). There could be 
multiple bpf progs running in a cgroup which want to clear this bit. e.g. the 
parent cgroup's prog may want to ensure this bit is not set.