[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210602214513.eprgwxqgqruliqeu@apollo>
Date: Thu, 3 Jun 2021 03:15:13 +0530
From: Kumar Kartikeya Dwivedi <memxor@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: bpf <bpf@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Jamal Hadi Salim <jhs@...atatu.com>,
Vlad Buslov <vladbu@...dia.com>,
Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Joe Stringer <joe@...ium.io>,
Quentin Monnet <quentin@...valent.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Toke Høiland-Jørgensen <toke@...hat.com>,
Networking <netdev@...r.kernel.org>
Subject: Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
On Thu, Jun 03, 2021 at 02:39:15AM IST, Andrii Nakryiko wrote:
> On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> <memxor@...il.com> wrote:
> >
> > This is the first RFC version.
> >
> > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > replace such filters, but the bpf_link is severed on indirect destruction of the
> > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > pinning approach can be used.
> >
> > The individual patches contain more details and comments, but the overall kernel
> > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> > merged recently. This means that we start by always setting direct action mode,
> > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> > options in the future, they can be easily exposed through the bpf_link API in
> > the future.
> >
> > Patch 1 refactors cls_bpf change function to extract two helpers that will be
> > reused in bpf_link creation.
> >
> > Patch 2 exports some bpf_link management functions to modules. This is needed
> > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> > tcf_proto would be weird, because the update path has to replace offloaded bpf
> > prog, which happens using internal cls_bpf helpers, and would in general be more
> > code to abstract over an operation that is unlikely to be implemented for other
> > filter types.
> >
> > Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> > obtaining block reference, creating the filter object, and then calls the
> > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> > setting up the internal structures. An optimization is made to not keep around
> > resources for extended actions, which is explained in a code comment as it wasn't
> > immediately obvious.
> >
> > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> > replacing the bpf_prog, we can skip tc filter's change path by reusing the
> > filter object but swapping its bpf_prog. This takes care of replacing the
> > offloaded prog as well (if that fails, update is aborted). So far however,
> > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> > would never be modified concurrently. This is no longer true, and to not
> > penalize the classify hot path, we also cannot impose serialization around
> > its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> > always consistent. Due to invocation in a RCU critical section, the lifetime of
> > the prog is guaranteed for the duration of the call.
> >
> > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> > function to libbpf.
> >
> > Patch 7 adds a selftest that exercises all possible problematic interactions
> > that I could think of.
> >
> > Design:
> >
> > This is where in the object hierarchy our bpf_link object is attached.
> >
> > ┌─────┐
> > │ │
> > │ BPF │
> > program
> > │ │
> > └──▲──┘
> > ┌───────┐ │
> > │ │ ┌──────┴───────┐
> > │ mod ├─────────► cls_bpf_prog │
> > ┌────────────────┐ │cls_bpf│ └────┬───▲─────┘
> > │ tcf_block │ │ │ │ │
> > └────────┬───────┘ └───▲───┘ │ │
> > │ ┌─────────────┐ │ ┌─▼───┴──┐
> > └──────────► tcf_chain │ │ │bpf_link│
> > └───────┬─────┘ │ └────────┘
> > │ ┌─────────────┐ │
> > └──────────► tcf_proto ├────┘
> > └─────────────┘
> >
> > The bpf_link is detached on destruction of the cls_bpf_prog. Doing it this way
> > allows us to implement update in a lightweight manner without having to recreate
> > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
> >
> > The other way to do it would be to link the bpf_link to tcf_proto, there are
> > numerous downsides to this:
> >
> > 1. All filters have to embed the pointer even though they won't be using it when
> > cls_bpf is compiled in.
> > 2. This probably won't make sense to be extended to other filter types anyway.
> > 3. We aren't able to optimize the update case without adding another bpf_link
> > specific update operation to tcf_proto ops.
> >
> > The downside with tying this to the module is having to export bpf_link
> > management functions and introducing a tcf_proto op. Hopefully the cost of
> > another operation func pointer is not big enough (as there is only one ops
> > struct per module).
> >
> > This first version is to collect feedback on the approach and get ideas if there
> > is a better way to do this.
>
> Bpf_link-based TC API is a long time coming, so it's great to see
> someone finally working on this. Thanks!
>
> I briefly skimmed through the patch set, noticed a few generic
> bpf_link problems. But I think main feedback will come from Cilium
Thanks for the review. I'll fix both of these in the resend (also have a couple
of private reports from the kernel test robot).
> folks and others that heavily rely on TC APIs. I wonder if there is an
> opportunity to simplify the API further given we have a new
> opportunity here. I don't think we are constrained to follow legacy TC
> API exactly.
>
I tried to keep it simple by going for the defaults we agreed upon for the new
netlink based libbpf API, and always setting direct action mode, and it's still
in a position to be extended in the future to allow full TC filter setup like
netlink does, if someone ever happens to need that.
As for the implementation, I did notice that there has been discussion around
this (though I could only find [0]) but I think doing it the way this patch does
is more flexible as you can attach the bpf filter to an aribitrary parent/class,
not just ingress and egress, and it can coexist with a conventional TC setup.
[0]: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html
("Note that there is ongoing work ...")
> The problem is that your patch set was marked as spam by Google, so I
> suspect a bunch of folks haven't gotten it. I suggest re-sending it
> again but trimming down the CC list, leaving only bpf@...r,
> netdev@...r, and BPF maintainers CC'ed directly.
>
Thanks for the heads up, I'll resend tomorrow.
--
Kartikeya
Powered by blists - more mailing lists