[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzZt5nRsfCiqGkJxW2b-==AYEVCiz6jHC-FrXKkPF=Qj7w@mail.gmail.com>
Date: Wed, 2 Jun 2021 14:09:15 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Kumar Kartikeya Dwivedi <memxor@...il.com>
Cc: bpf <bpf@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Jamal Hadi Salim <jhs@...atatu.com>,
Vlad Buslov <vladbu@...dia.com>,
Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Joe Stringer <joe@...ium.io>,
Quentin Monnet <quentin@...valent.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Toke Høiland-Jørgensen <toke@...hat.com>,
Networking <netdev@...r.kernel.org>
Subject: Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@...il.com> wrote:
>
> This is the first RFC version.
>
> This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> introduces fd based ownership for such TC filters. Netlink cannot delete or
> replace such filters, but the bpf_link is severed on indirect destruction of the
> filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> that filters remain attached beyond process lifetime, the usual bpf_link fd
> pinning approach can be used.
>
> The individual patches contain more details and comments, but the overall kernel
> API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> merged recently. This means that we start by always setting direct action mode,
> protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> options in the future, they can be easily exposed through the bpf_link API in
> the future.
>
> Patch 1 refactors cls_bpf change function to extract two helpers that will be
> reused in bpf_link creation.
>
> Patch 2 exports some bpf_link management functions to modules. This is needed
> because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> tcf_proto would be weird, because the update path has to replace offloaded bpf
> prog, which happens using internal cls_bpf helpers, and would in general be more
> code to abstract over an operation that is unlikely to be implemented for other
> filter types.
>
> Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> obtaining block reference, creating the filter object, and then calls the
> bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> setting up the internal structures. An optimization is made to not keep around
> resources for extended actions, which is explained in a code comment as it wasn't
> immediately obvious.
>
> Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> replacing the bpf_prog, we can skip tc filter's change path by reusing the
> filter object but swapping its bpf_prog. This takes care of replacing the
> offloaded prog as well (if that fails, update is aborted). So far however,
> tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> would never be modified concurrently. This is no longer true, and to not
> penalize the classify hot path, we also cannot impose serialization around
> its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> always consistent. Due to invocation in a RCU critical section, the lifetime of
> the prog is guaranteed for the duration of the call.
>
> Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> function to libbpf.
>
> Patch 7 adds a selftest that exercises all possible problematic interactions
> that I could think of.
>
> Design:
>
> This is where in the object hierarchy our bpf_link object is attached.
>
> ┌─────┐
> │ │
> │ BPF │
> program
> │ │
> └──▲──┘
> ┌───────┐ │
> │ │ ┌──────┴───────┐
> │ mod ├─────────► cls_bpf_prog │
> ┌────────────────┐ │cls_bpf│ └────┬───▲─────┘
> │ tcf_block │ │ │ │ │
> └────────┬───────┘ └───▲───┘ │ │
> │ ┌─────────────┐ │ ┌─▼───┴──┐
> └──────────► tcf_chain │ │ │bpf_link│
> └───────┬─────┘ │ └────────┘
> │ ┌─────────────┐ │
> └──────────► tcf_proto ├────┘
> └─────────────┘
>
> The bpf_link is detached on destruction of the cls_bpf_prog. Doing it this way
> allows us to implement update in a lightweight manner without having to recreate
> a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
>
> The other way to do it would be to link the bpf_link to tcf_proto, there are
> numerous downsides to this:
>
> 1. All filters have to embed the pointer even though they won't be using it when
> cls_bpf is compiled in.
> 2. This probably won't make sense to be extended to other filter types anyway.
> 3. We aren't able to optimize the update case without adding another bpf_link
> specific update operation to tcf_proto ops.
>
> The downside with tying this to the module is having to export bpf_link
> management functions and introducing a tcf_proto op. Hopefully the cost of
> another operation func pointer is not big enough (as there is only one ops
> struct per module).
>
> This first version is to collect feedback on the approach and get ideas if there
> is a better way to do this.
Bpf_link-based TC API is a long time coming, so it's great to see
someone finally working on this. Thanks!
I briefly skimmed through the patch set, noticed a few generic
bpf_link problems. But I think main feedback will come from Cilium
folks and others that heavily rely on TC APIs. I wonder if there is an
opportunity to simplify the API further given we have a new
opportunity here. I don't think we are constrained to follow legacy TC
API exactly.
The problem is that your patch set was marked as spam by Google, so I
suspect a bunch of folks haven't gotten it. I suggest re-sending it
again but trimming down the CC list, leaving only bpf@...r,
netdev@...r, and BPF maintainers CC'ed directly.
>
> Kumar Kartikeya Dwivedi (7):
> net: sched: refactor cls_bpf creation code
> bpf: export bpf_link functions for modules
> net: sched: add bpf_link API for bpf classifier
> net: sched: add lightweight update path for cls_bpf
> tools: bpf.h: sync with kernel sources
> libbpf: add bpf_link based TC-BPF management API
> libbpf: add selftest for bpf_link based TC-BPF management API
>
> include/linux/bpf_types.h | 3 +
> include/net/pkt_cls.h | 13 +
> include/net/sch_generic.h | 6 +-
> include/uapi/linux/bpf.h | 15 +
> kernel/bpf/syscall.c | 14 +-
> net/sched/cls_api.c | 138 ++++++-
> net/sched/cls_bpf.c | 386 ++++++++++++++++--
> tools/include/uapi/linux/bpf.h | 15 +
> tools/lib/bpf/bpf.c | 5 +
> tools/lib/bpf/bpf.h | 8 +-
> tools/lib/bpf/libbpf.c | 59 ++-
> tools/lib/bpf/libbpf.h | 17 +
> tools/lib/bpf/libbpf.map | 1 +
> tools/lib/bpf/netlink.c | 5 +-
> tools/lib/bpf/netlink.h | 8 +
> .../selftests/bpf/prog_tests/tc_bpf_link.c | 285 +++++++++++++
> 16 files changed, 934 insertions(+), 44 deletions(-)
> create mode 100644 tools/lib/bpf/netlink.h
> create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c
>
> --
> 2.31.1
>
Powered by blists - more mailing lists