[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zgvirj6g.fsf@toke.dk>
Date: Mon, 21 Jun 2021 23:39:51 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Daniel Borkmann <daniel@...earbox.net>, bpf@...r.kernel.org,
netdev@...r.kernel.org
Cc: Martin KaFai Lau <kafai@...com>,
Hangbin Liu <liuhangbin@...il.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Magnus Karlsson <magnus.karlsson@...il.com>,
"Paul E . McKenney" <paulmck@...nel.org>,
Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to
redirect map entries
Daniel Borkmann <daniel@...earbox.net> writes:
> On 6/17/21 11:27 PM, Toke Høiland-Jørgensen wrote:
>> XDP_REDIRECT works by a three-step process: the bpf_redirect() and
>> bpf_redirect_map() helpers will lookup the target of the redirect and store
>> it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
>> Next, when the program returns the XDP_REDIRECT return code, the driver
>> will call xdp_do_redirect() which will use the information thus stored to
>> actually enqueue the frame into a bulk queue structure (that differs
>> slightly by map type, but shares the same principle). Finally, before
>> exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
>> flush all the different bulk queues, thus completing the redirect.
>>
>> Pointers to the map entries will be kept around for this whole sequence of
>> steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>> the core code; instead drivers add their own rcu_read_lock() around the XDP
>> portions of the code, but somewhat inconsistently as Martin discovered[0].
>> However, things still work because everything happens inside a single NAPI
>> poll sequence, which means it's between a pair of calls to
>> local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
>> document this intention by using rcu_dereference_check() with
>> rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
>> lockdep to verify that everything is done correctly.
>>
>> This patch does just that: we add an __rcu annotation to the map entry
>> pointers and remove the various comments explaining the NAPI poll assurance
>> strewn through devmap.c in favour of a longer explanation in filter.c. The
>> goal is to have one coherent documentation of the entire flow, and rely on
>> the RCU annotations as a "standard" way of communicating the flow in the
>> map code (which can additionally be understood by sparse and lockdep).
>>
>> The RCU annotation replacements result in a fairly straight-forward
>> replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
>> becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
>> proper constructs to cast the pointer back and forth between __rcu and
>> __kernel address space (for the benefit of sparse). The one complication is
>> that xskmap has a few constructions where double-pointers are passed back
>> and forth; these simply all gain __rcu annotations, and only the final
>> reference/dereference to the inner-most pointer gets changed.
>>
>> With this, everything can be run through sparse without eliciting
>> complaints, and lockdep can verify correctness even without the use of
>> rcu_read_lock() in the drivers. Subsequent patches will clean these up from
>> the drivers.
>>
>> [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
>> [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
>>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@...hat.com>
>> ---
>> include/net/xdp_sock.h | 2 +-
>> kernel/bpf/cpumap.c | 13 +++++++----
>> kernel/bpf/devmap.c | 49 ++++++++++++++++++------------------------
>> net/core/filter.c | 28 ++++++++++++++++++++++++
>> net/xdp/xsk.c | 4 ++--
>> net/xdp/xsk.h | 4 ++--
>> net/xdp/xskmap.c | 29 ++++++++++++++-----------
>> 7 files changed, 80 insertions(+), 49 deletions(-)
> [...]
>> __dev_map_entry_free);
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index caa88955562e..0b7db5c70385 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -3922,6 +3922,34 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>> .arg2_type = ARG_ANYTHING,
>> };
>>
>> +/* XDP_REDIRECT works by a three-step process, implemented in the functions
>> + * below:
>> + *
>> + * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
>> + * of the redirect and store it (along with some other metadata) in a per-CPU
>> + * struct bpf_redirect_info.
>> + *
>> + * 2. When the program returns the XDP_REDIRECT return code, the driver will
>> + * call xdp_do_redirect() which will use the information in struct
>> + * bpf_redirect_info to actually enqueue the frame into a map type-specific
>> + * bulk queue structure.
>> + *
>> + * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
>> + * which will flush all the different bulk queues, thus completing the
>> + * redirect.
>> + *
>> + * Pointers to the map entries will be kept around for this whole sequence of
>> + * steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>> + * the core code; instead, the RCU protection relies on everything happening
>> + * inside a single NAPI poll sequence, which means it's between a pair of calls
>> + * to local_bh_disable()/local_bh_enable().
>> + *
>> + * The map entries are marked as __rcu and the map code makes sure to
>> + * dereference those pointers with rcu_dereference_check() in a way that works
>> + * for both sections that to hold an rcu_read_lock() and sections that are
>> + * called from NAPI without a separate rcu_read_lock(). The code below does not
>> + * use RCU annotations, but relies on those in the map code.
>
> One more follow-up question related to tc BPF: given we do use rcu_read_lock_bh()
> in case of sch_handle_egress(), could we also remove the rcu_read_lock() pair
> from cls_bpf_classify() then?
I believe so, yeah. Patch 2 in this series should even make lockdep stop
complaining about it :)
I can add a patch removing the rcu_read_lock() from cls_bpf in the next
version.
> It would also be great if this scenario in general could be placed
> under the Documentation/RCU/whatisRCU.rst as an example, so we could
> refer to the official doc on this, too, if Paul is good with this.
I'll take a look and see if I can find a way to fit it in there...
> Could you also update the RCU comment in bpf_prog_run_xdp()? Or
> alternatively move all the below driver comments in there as a single
> location?
>
> /* This code is invoked within a single NAPI poll cycle and thus under
> * local_bh_disable(), which provides the needed RCU protection.
> */
Sure, can do. And yeah, I do agree that moving the comment in there
makes more sense than scattering it over all the drivers, even if that
means I have to go back and edit all the drivers again :P
-Toke
Powered by blists - more mailing lists