netdev - Re: [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to redirect map entries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2f11d71d-0298-4177-6ac6-4483adf36ed9@iogearbox.net>
Date:   Tue, 22 Jun 2021 10:50:36 +0200
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Toke Høiland-Jørgensen <toke@...hat.com>,
        bpf@...r.kernel.org, netdev@...r.kernel.org
Cc:     Martin KaFai Lau <kafai@...com>,
        Hangbin Liu <liuhangbin@...il.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Magnus Karlsson <magnus.karlsson@...il.com>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to
 redirect map entries

On 6/22/21 12:35 AM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@...earbox.net> writes:
>> On 6/21/21 11:39 PM, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann <daniel@...earbox.net> writes:
>>>> On 6/17/21 11:27 PM, Toke Høiland-Jørgensen wrote:
>>>>> XDP_REDIRECT works by a three-step process: the bpf_redirect() and
>>>>> bpf_redirect_map() helpers will lookup the target of the redirect and store
>>>>> it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
>>>>> Next, when the program returns the XDP_REDIRECT return code, the driver
>>>>> will call xdp_do_redirect() which will use the information thus stored to
>>>>> actually enqueue the frame into a bulk queue structure (that differs
>>>>> slightly by map type, but shares the same principle). Finally, before
>>>>> exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
>>>>> flush all the different bulk queues, thus completing the redirect.
>>>>>
>>>>> Pointers to the map entries will be kept around for this whole sequence of
>>>>> steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>>>>> the core code; instead drivers add their own rcu_read_lock() around the XDP
>>>>> portions of the code, but somewhat inconsistently as Martin discovered[0].
>>>>> However, things still work because everything happens inside a single NAPI
>>>>> poll sequence, which means it's between a pair of calls to
>>>>> local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
>>>>> document this intention by using rcu_dereference_check() with
>>>>> rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
>>>>> lockdep to verify that everything is done correctly.
>>>>>
>>>>> This patch does just that: we add an __rcu annotation to the map entry
>>>>> pointers and remove the various comments explaining the NAPI poll assurance
>>>>> strewn through devmap.c in favour of a longer explanation in filter.c. The
>>>>> goal is to have one coherent documentation of the entire flow, and rely on
>>>>> the RCU annotations as a "standard" way of communicating the flow in the
>>>>> map code (which can additionally be understood by sparse and lockdep).
>>>>>
>>>>> The RCU annotation replacements result in a fairly straight-forward
>>>>> replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
>>>>> becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
>>>>> proper constructs to cast the pointer back and forth between __rcu and
>>>>> __kernel address space (for the benefit of sparse). The one complication is
>>>>> that xskmap has a few constructions where double-pointers are passed back
>>>>> and forth; these simply all gain __rcu annotations, and only the final
>>>>> reference/dereference to the inner-most pointer gets changed.
>>>>>
>>>>> With this, everything can be run through sparse without eliciting
>>>>> complaints, and lockdep can verify correctness even without the use of
>>>>> rcu_read_lock() in the drivers. Subsequent patches will clean these up from
>>>>> the drivers.
>>>>>
>>>>> [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
>>>>> [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
>>>>>
>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@...hat.com>
>>>>> ---
>>>>>     include/net/xdp_sock.h |  2 +-
>>>>>     kernel/bpf/cpumap.c    | 13 +++++++----
>>>>>     kernel/bpf/devmap.c    | 49 ++++++++++++++++++------------------------
>>>>>     net/core/filter.c      | 28 ++++++++++++++++++++++++
>>>>>     net/xdp/xsk.c          |  4 ++--
>>>>>     net/xdp/xsk.h          |  4 ++--
>>>>>     net/xdp/xskmap.c       | 29 ++++++++++++++-----------
>>>>>     7 files changed, 80 insertions(+), 49 deletions(-)
>>>> [...]
>>>>>     						 __dev_map_entry_free);
>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>> index caa88955562e..0b7db5c70385 100644
>>>>> --- a/net/core/filter.c
>>>>> +++ b/net/core/filter.c
>>>>> @@ -3922,6 +3922,34 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>>>>     	.arg2_type	= ARG_ANYTHING,
>>>>>     };
>>>>>     
>>>>> +/* XDP_REDIRECT works by a three-step process, implemented in the functions
>>>>> + * below:
>>>>> + *
>>>>> + * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
>>>>> + *    of the redirect and store it (along with some other metadata) in a per-CPU
>>>>> + *    struct bpf_redirect_info.
>>>>> + *
>>>>> + * 2. When the program returns the XDP_REDIRECT return code, the driver will
>>>>> + *    call xdp_do_redirect() which will use the information in struct
>>>>> + *    bpf_redirect_info to actually enqueue the frame into a map type-specific
>>>>> + *    bulk queue structure.
>>>>> + *
>>>>> + * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
>>>>> + *    which will flush all the different bulk queues, thus completing the
>>>>> + *    redirect.
>>>>> + *
>>>>> + * Pointers to the map entries will be kept around for this whole sequence of
>>>>> + * steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>>>>> + * the core code; instead, the RCU protection relies on everything happening
>>>>> + * inside a single NAPI poll sequence, which means it's between a pair of calls
>>>>> + * to local_bh_disable()/local_bh_enable().
>>>>> + *
>>>>> + * The map entries are marked as __rcu and the map code makes sure to
>>>>> + * dereference those pointers with rcu_dereference_check() in a way that works
>>>>> + * for both sections that to hold an rcu_read_lock() and sections that are
>>>>> + * called from NAPI without a separate rcu_read_lock(). The code below does not
>>>>> + * use RCU annotations, but relies on those in the map code.
>>>>
>>>> One more follow-up question related to tc BPF: given we do use rcu_read_lock_bh()
>>>> in case of sch_handle_egress(), could we also remove the rcu_read_lock() pair
>>>> from cls_bpf_classify() then?
>>>
>>> I believe so, yeah. Patch 2 in this series should even make lockdep stop
>>> complaining about it :)
>>
>> Btw, I was wondering whether we should just get rid of all the WARN_ON_ONCE()s
>> from those map helpers given in most situations these are not triggered anyway
>> due to retpoline avoidance where verifier rewrites the calls to jump to the map
>> backend implementation directly. One alternative could be to have an extension
>> to the bpf prologue generation under CONFIG_DEBUG_LOCK_ALLOC and call the lockdep
>> checks from there, but it's probably not worth the effort. (In the trampoline
>> case we have those __bpf_prog_enter()/__bpf_prog_enter_sleepable() where the
>> latter in particular has asserts like might_fault(), fwiw.)
> 
> I agree that it's probably overkill to amend the prologue. No strong
> opinion on whether removing the checks entirely is a good idea; I guess
> they at least serve as documentation even if they're not actually called
> that often?

Ack, that's okay with me, and if we find a better solution, we can always change it
later on.

Thanks,
Daniel