[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <875zh230bh.fsf@cloudflare.com>
Date: Thu, 23 Jan 2020 11:59:46 +0100
From: Jakub Sitnicki <jakub@...udflare.com>
To: Martin Lau <kafai@...com>
Cc: "bpf\@vger.kernel.org" <bpf@...r.kernel.org>,
"netdev\@vger.kernel.org" <netdev@...r.kernel.org>,
"kernel-team\@cloudflare.com" <kernel-team@...udflare.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
Daniel Borkmann <daniel@...earbox.net>,
"John Fastabend" <john.fastabend@...il.com>,
Lorenz Bauer <lmb@...udflare.com>
Subject: Re: [PATCH bpf-next v3 10/12] net: Generate reuseport group ID on group creation
On Wed, Jan 22, 2020 at 11:53 PM CET, Martin Lau wrote:
> On Wed, Jan 22, 2020 at 02:05:47PM +0100, Jakub Sitnicki wrote:
>> Commit 736b46027eb4 ("net: Add ID (if needed) to sock_reuseport and expose
>> reuseport_lock") has introduced lazy generation of reuseport group IDs that
>> survive group resize.
>>
>> By comparing the identifier we check if BPF reuseport program is not trying
>> to select a socket from a BPF map that belongs to a different reuseport
>> group than the one the packet is for.
>>
>> Because SOCKARRAY used to be the only BPF map type that can be used with
>> reuseport BPF, it was possible to delay the generation of reuseport group
>> ID until a socket from the group was inserted into BPF map for the first
>> time.
>>
>> Now that SOCKMAP can be used with reuseport BPF we have two options, either
>> generate the reuseport ID on map update, like SOCKARRAY does, or allocate
>> an ID from the start when reuseport group gets created.
>>
>> This patch goes the latter approach to keep SOCKMAP free of calls into
>> reuseport code. This streamlines the reuseport_id access as its lifetime
>> now matches the longevity of reuseport object.
>>
>> The cost of this simplification, however, is that we allocate reuseport IDs
>> for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
>> setups. With the way identifiers are currently generated, we can have at
>> most S32_MAX reuseport groups, which hopefully is sufficient.
> Not sure if it would be a concern. I think it is good as is.
> For TCP, that would mean billion different ip:port listening socks
> in inet_hashinfo.
>
> If it came to that, another idea is to use a 64bit reuseport_id which
> practically won't wrap around. It could use the very first sk->sk_cookie
> as the reuseport_id. All the ida logic will go away also in the expense
> of +4 bytes.
Thanks for the idea. I'll add it to the patch description.
>
>>
>> Another change is that we now always call into SOCKARRAY logic to unlink
>> the socket from the map when unhashing or closing the socket. Previously we
>> did it only when at least one socket from the group was in a BPF map.
>>
>> It is worth noting that this doesn't conflict with SOCKMAP tear-down in
>> case a socket is in a SOCKMAP and belongs to a reuseport group. SOCKMAP
>> tear-down happens first:
>>
>> prot->unhash
>> `- tcp_bpf_unhash
>> |- tcp_bpf_remove
>> | `- while (sk_psock_link_pop(psock))
>> | `- sk_psock_unlink
>> | `- sock_map_delete_from_link
>> | `- __sock_map_delete
>> | `- sock_map_unref
>> | `- sk_psock_put
>> | `- sk_psock_drop
>> | `- rcu_assign_sk_user_data(sk, NULL)
>> `- inet_unhash
>> `- reuseport_detach_sock
>> `- bpf_sk_reuseport_detach
>> `- WRITE_ONCE(sk->sk_user_data, NULL)
> Thanks for the details.
>
> [ ... ]
>
>> @@ -200,12 +189,10 @@ void reuseport_detach_sock(struct sock *sk)
>> reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
>> lockdep_is_held(&reuseport_lock));
>>
>> - /* At least one of the sk in this reuseport group is added to
>> - * a bpf map. Notify the bpf side. The bpf map logic will
>> - * remove the sk if it is indeed added to a bpf map.
>> + /* Notify the bpf side. The sk may be added to bpf map. The
>> + * bpf map logic will remove the sk from the map if indeed.
> s/indeed/needed/ ?
>
> I think it will be good to have a few words here like, that is needed
> by sockarray but not necessary for sockmap which has its own ->unhash
> to remove itself from the map.
Yeah, I didn't do it justice. Will expand the comment.
[...]
Powered by blists - more mailing lists