[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99681548-fa79-0607-d574-db61818cab78@iogearbox.net>
Date: Thu, 25 May 2023 22:01:08 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Kuniyuki Iwashima <kuniyu@...zon.com>, lmb@...valent.com
Cc: andrii@...nel.org, ast@...nel.org, bpf@...r.kernel.org,
davem@...emloft.net, dsahern@...nel.org, edumazet@...gle.com,
haoluo@...gle.com, joe@...ium.io, joe@...d.net.nz,
john.fastabend@...il.com, jolsa@...nel.org, kafai@...com,
kpsingh@...nel.org, kuba@...nel.org, linux-kernel@...r.kernel.org,
martin.lau@...ux.dev, netdev@...r.kernel.org, pabeni@...hat.com,
sdf@...gle.com, song@...nel.org, willemdebruijn.kernel@...il.com,
yhs@...com
Subject: Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with
bpf_sk_assign
On 5/25/23 7:41 PM, Kuniyuki Iwashima wrote:
> From: Lorenz Bauer <lmb@...valent.com>
> Date: Thu, 25 May 2023 09:19:22 +0100
>> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
>> sockets. This means we can't use the helper to steer traffic to Envoy, which
>> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
>> TPROXY from our setup.
>>
>> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
>> helpers don't execute SK_REUSEPORT programs. Instead, one of the
>> reuseport sockets is selected by hash. This could cause dispatch to the
>> "wrong" socket:
>>
>> sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>> bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>>
>> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
>> helpers unfortunately. In the tc context, L2 headers are at the start
>> of the skb, while SK_REUSEPORT expects L3 headers instead.
>>
>> Instead, we execute the SK_REUSEPORT program when the assigned socket
>> is pulled out of the skb, further up the stack. This creates some
>> trickiness with regards to refcounting as bpf_sk_assign will put both
>> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
>> freed. We can infer that the sk_assigned socket is RCU freed if the
>> reuseport lookup succeeds, but convincing yourself of this fact isn't
>> straight forward. Therefore we defensively check refcounting on the
>> sk_assign sock even though it's probably not required in practice.
>>
>> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
>> Fixes: cf7fbe6 ("bpf: Add socket assign support")
>
> Please use 12 chars of hash.
>
> $ cat ~/.gitconfig
> [core]
> abbrev = 12
> [pretty]
> fixes = Fixes: %h (\"%s\")
>
> $ git show 8e368dc --pretty=fixes | head -n 1
> Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Yeap, not quite sure what happened here but the 12 chars is clear. Will
be fixed up in v2, too, ofc.
>> Co-developed-by: Daniel Borkmann <daniel@...earbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
>> Signed-off-by: Lorenz Bauer <lmb@...valent.com>
>> Cc: Joe Stringer <joe@...ium.io>
>> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
>> ---
>> include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>> include/net/inet_hashtables.h | 27 +++++++++++++++++++++++--
>> include/net/sock.h | 7 +++++--
>> include/uapi/linux/bpf.h | 3 ---
>> net/core/filter.c | 2 --
>> net/ipv4/inet_hashtables.c | 15 +++++++-------
>> net/ipv4/udp.c | 23 +++++++++++++++++++---
>> net/ipv6/inet6_hashtables.c | 19 +++++++++---------
>> net/ipv6/udp.c | 23 +++++++++++++++++++---
>> tools/include/uapi/linux/bpf.h | 3 ---
>> 10 files changed, 119 insertions(+), 39 deletions(-)
>>
[...]
>> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>> int iif, int sdif,
>> bool *refcounted)
>> {
>> - struct sock *sk = skb_steal_sock(skb, refcounted);
>> -
>> + bool prefetched;
>> + struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
>> + struct net *net = dev_net(skb_dst(skb)->dev);
>> + const struct ipv6hdr *ip6h = ipv6_hdr(skb);
>
> nit: Reverse Xmas Tree order. Same for other chunks.
It is, the prefetched bool is simply used one line below. I don't think
this is much different than most other code from style pov..
>> +
>> + if (prefetched) {
>> + struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
>> + &ip6h->saddr, sport,
>> + &ip6h->daddr, ntohs(dport));
>> + if (reuse_sk) {
>> + if (reuse_sk != sk) {
>> + if (*refcounted) {
>> + sock_put(sk);
>> + *refcounted = false;
>> + }
>> + if (IS_ERR(reuse_sk))
>> + return NULL;
>> + }
>> + return reuse_sk;
>> + }
>
> Maybe we can add a hepler to avoid this duplication ?
We'll check if it can be made a bit nicer and integrate this into the v2.
Thanks,
Daniel
Powered by blists - more mailing lists