[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a71b8941-1ffc-37fc-6676-d3b4cf44f149@iogearbox.net>
Date: Thu, 25 May 2023 21:51:14 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Eric Dumazet <edumazet@...gle.com>, Lorenz Bauer <lmb@...valent.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski
<kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Song Liu <song@...nel.org>,
Yonghong Song <yhs@...com>, John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...gle.com>,
Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
David Ahern <dsahern@...nel.org>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Joe Stringer <joe@...d.net.nz>, Joe Stringer <joe@...ium.io>,
Martin KaFai Lau <kafai@...com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with
bpf_sk_assign
On 5/25/23 3:24 PM, Eric Dumazet wrote:
> On Thu, May 25, 2023 at 10:19 AM Lorenz Bauer <lmb@...valent.com> wrote:
>>
>> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
>> sockets. This means we can't use the helper to steer traffic to Envoy, which
>> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
>> TPROXY from our setup.
>>
>> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
>> helpers don't execute SK_REUSEPORT programs. Instead, one of the
>> reuseport sockets is selected by hash. This could cause dispatch to the
>> "wrong" socket:
>>
>> sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>> bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>>
>> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
>> helpers unfortunately. In the tc context, L2 headers are at the start
>> of the skb, while SK_REUSEPORT expects L3 headers instead.
>>
>> Instead, we execute the SK_REUSEPORT program when the assigned socket
>> is pulled out of the skb, further up the stack. This creates some
>> trickiness with regards to refcounting as bpf_sk_assign will put both
>> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
>> freed. We can infer that the sk_assigned socket is RCU freed if the
>> reuseport lookup succeeds, but convincing yourself of this fact isn't
>> straight forward. Therefore we defensively check refcounting on the
>> sk_assign sock even though it's probably not required in practice.
>>
>> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
>> Fixes: cf7fbe6 ("bpf: Add socket assign support")
>> Co-developed-by: Daniel Borkmann <daniel@...earbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
>> Signed-off-by: Lorenz Bauer <lmb@...valent.com>
>> Cc: Joe Stringer <joe@...ium.io>
>> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
>> ---
>> include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>> include/net/inet_hashtables.h | 27 +++++++++++++++++++++++--
>> include/net/sock.h | 7 +++++--
>> include/uapi/linux/bpf.h | 3 ---
>> net/core/filter.c | 2 --
>> net/ipv4/inet_hashtables.c | 15 +++++++-------
>> net/ipv4/udp.c | 23 +++++++++++++++++++---
>> net/ipv6/inet6_hashtables.c | 19 +++++++++---------
>> net/ipv6/udp.c | 23 +++++++++++++++++++---
>> tools/include/uapi/linux/bpf.h | 3 ---
>> 10 files changed, 119 insertions(+), 39 deletions(-)
>
>
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index e7391bf310a7..920131e4a65d 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>> return score;
>> }
>>
>> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>> - struct sk_buff *skb, int doff,
>> - __be32 saddr, __be16 sport,
>> - __be32 daddr, unsigned short hnum)
>> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
>> + struct sk_buff *skb, int doff,
>> + __be32 saddr, __be16 sport,
>> + __be32 daddr, unsigned short hnum)
>> {
>> struct sock *reuse_sk = NULL;
>> u32 phash;
>> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>> }
>> return reuse_sk;
>> }
>> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>>
>> /*
>> * Here are some nice properties to exploit here. The BSD API
>> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>> sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>> score = compute_score(sk, net, hnum, daddr, dif, sdif);
>> if (score > hiscore) {
>> - result = lookup_reuseport(net, sk, skb, doff,
>> - saddr, sport, daddr, hnum);
>> + result = inet_lookup_reuseport(net, sk, skb, doff,
>> + saddr, sport, daddr, hnum);
>> if (result)
>> return result;
>>
>
> Please split in a series.
>
> First a patch renaming lookup_reuseport() to inet_lookup_reuseport()
> and inet6_lookup_reuseport()
> (cleanup, no change in behavior)
>
> This would ease review and future bug hunting quite a bit.
Makes sense and should reduce the churn on the actual change.
I think Lorenz is planning to flush out a v2 next week with this split.
Thanks,
Daniel
Powered by blists - more mailing lists