[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2bac5d14-6927-4915-b1a8-e6301603e663@linux.dev>
Date: Tue, 26 Aug 2025 13:06:40 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Kuniyuki Iwashima <kuniyu@...gle.com>
Cc: almasrymina@...gle.com, andrii@...nel.org, ast@...nel.org,
bpf@...r.kernel.org, daniel@...earbox.net, davem@...emloft.net,
edumazet@...gle.com, hannes@...xchg.org, john.fastabend@...il.com,
kuba@...nel.org, kuni1840@...il.com, mhocko@...nel.org,
ncardwell@...gle.com, netdev@...r.kernel.org, pabeni@...hat.com,
roman.gushchin@...ux.dev, sdf@...ichev.me, shakeel.butt@...ux.dev,
willemb@...gle.com
Subject: Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in
__inet_accept().
On 8/25/25 5:23 PM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@...ux.dev>
> Date: Mon, 25 Aug 2025 16:14:35 -0700
>> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
>>> On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>>>>
>>>> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
>>>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>>>>> index ae83ecda3983..ab613abdfaa4 100644
>>>>> --- a/net/ipv4/af_inet.c
>>>>> +++ b/net/ipv4/af_inet.c
>>>>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
>>>>> kmem_cache_charge(newsk, gfp);
>>>>> }
>>>>>
>>>>> + BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
>>>>> +
>>>>> if (mem_cgroup_sk_enabled(newsk)) {
>>>>> int amt;
>>>>>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 233de8677382..80df246d4741 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
>>>>> BPF_NETKIT_PEER,
>>>>> BPF_TRACE_KPROBE_SESSION,
>>>>> BPF_TRACE_UPROBE_SESSION,
>>>>> + BPF_CGROUP_INET_SOCK_ACCEPT,
>>>>
>>>> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
>>>> inherited from the listener?
>>>
>>> Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
>>> accept() because the child socket could be created during irq context with
>>> unrelated cgroup. This had another reason; if the listener was created in the
>>> root cgroup and passed to a process under cgroup, child sockets would never
>>> have sk_memcg if sk_memcg was inherited.
>>>
>>> So, the child's memcg is not always the same one with the listener's, and
>>> we cannot rely on the listener's sk_memcg.
>>
>> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit
>> the SK_BPF_MEMCG_SOCK_ISOLATED bit.
>
> I didn't want to let the flag remain alone without accept() (error-prone
> but works because we always check mem_cgroup_from_sk() before the bit)
> and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we
> don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB
> or bpf_iter etc.
I think this could be a surprise to the user. imo, this is the implementation
details that a bit of a pointer is used for the setsockopt purpose and a right
one for perf reason. It does not necessary need to affect what the user can
expect from setsockopt in listener. From the user pov, what the user can usually
expect from setsockopt in the listener and gets copied to child? Beside, the
listener and the accept-or on different processes is one of the use case but not
the only use case.
>
> ---8<---
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a78356682442..9ef458fe706e 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
>
> static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
> {
> - if (!mem_cgroup_sk_enabled(sk))
> + if (!sk_has_account(sk))
> return -EOPNOTSUPP;
>
> if (getopt) {
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index e92dfca0a0ff..efae15d04306 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> if (mem_cgroup_sockets_enabled &&
> (!IS_ENABLED(CONFIG_IP_SCTP) ||
> sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
> + unsigned short flags = mem_cgroup_sk_get_flags(newsk);
> +
> mem_cgroup_sk_alloc(newsk);
> + mem_cgroup_sk_set_flags(newsk, flags);
> kmem_cache_charge(newsk, gfp);
> }
>
> ---8<---
>
>
>>
>> If it can only be done at accept, there is already an existing
>> SEC("lsm_cgroup/socket_accept") hook. Take a look at
>> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't
>> have access to the "newsock->sk" but it should have access to the "sock->sk", do
>> bpf_setsockopt and then inherit by the newsock->sk (?)
>>
>> There are already quite enough cgroup-sk style hooks. I would prefer not to add
>> another cgroup attach_type and instead see if some of the existing ones can be
>> reused. There is also SEC("lsm/sock_graft").
>
> We need to do fixup below, so lsm_cgroup/socket_accept won't work
> if we keep the code in __inet_accept(). We can move this after
> lsm/sock_graft within __inet_accept().
>
> if (mem_cgroup_sk_isolated(newsk))
> sk_memory_allocated_sub(newsk, amt);
If I read it correctly, lsm_cgroup/sock_graft should work but need to do the
above sk_memory_allocated_sub() after the sock_graft and ...
>
> But then, we cannot distinguish it with other hooks (lock_sock() &&
> sk->sk_socket != NULL), and finally the fixup must be done dynamically
> in setsockopt().
... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed
once the socket fd is visible to the user. The current approach is to use the
observation in the owned_by_user and sk->sk_socket in the create and accept
hook. [ unrelated, I am not sure about the owned_by_user check considering
sol_socket_sockopt can be called from bh ].
If it is needed, there are other ways to stop the SK_BPF_MEMCG_SOCK_ISOLATED
being changed again once the fd is visible to user. e.g. there are still bits
left in the sk_memcg pointer to freeze it at runtime. If doing it statically
(i.e. at prog load time), it can probably return a different setsockopt_proto
that can understand the SK_BPF_MEMCG_FLAGS.
Powered by blists - more mailing lists