[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAM0EoMkzjomkpYQ5XZQLsrhuT2ZDhJy_GwiMJmnxL4eJ-s+v1g@mail.gmail.com>
Date: Wed, 21 Feb 2024 09:51:29 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: netdev@...r.kernel.org, deb.chatterjee@...el.com, anjali.singhai@...el.com,
namrata.limaye@...el.com, tom@...anda.io, mleitner@...hat.com,
Mahesh.Shirshyad@....com, tomasz.osinski@...el.com, jiri@...nulli.us,
xiyou.wangcong@...il.com, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, vladbu@...dia.com, horms@...nel.org,
khalidm@...dia.com, toke@...hat.com, mattyk@...dia.com, bpf@...r.kernel.org,
Victor Nogueira <victor@...atatu.com>, Pedro Tammela <pctammela@...atatu.com>
Subject: Re: [PATCH v10 net-next 15/15] p4tc: add P4 classifier
On Tue, Feb 20, 2024 at 10:49 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> On 2/16/24 10:18 PM, Jamal Hadi Salim wrote:
> > On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> >> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@...earbox.net> wrote:
> >>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> >>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@...earbox.net> wrote:
> >>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> >>> [...]
> >>>>>>
> >>>>>> It should also be noted that it is feasible to split some of the ingress
> >>>>>> datapath into XDP first and more into TC later (as was shown above for
> >>>>>> example where the parser runs at XDP level). YMMV.
> >>>>>> Regardless of choice of which scheme to use, none of these will affect
> >>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs
> >>>>>> tc, etc.
> >>>>>>
> >>>>>> Co-developed-by: Victor Nogueira <victor@...atatu.com>
> >>>>>> Signed-off-by: Victor Nogueira <victor@...atatu.com>
> >>>>>> Co-developed-by: Pedro Tammela <pctammela@...atatu.com>
> >>>>>> Signed-off-by: Pedro Tammela <pctammela@...atatu.com>
> >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@...atatu.com>
> >>>>>
> >>>>> My objections from last iterations still stand, and I also added a nak,
> >>>>> so please do not just drop it with new revisions.. from the v10 as you
> >>>>> wrote you added further code but despite the various community feedback
> >>>>> the design still stands as before, therefore:
> >>>>>
> >>>>> Nacked-by: Daniel Borkmann <daniel@...earbox.net>
> >>>>
> >>>> We didnt make code changes - but did you read the cover letter and the
> >>>> extended commentary in this patch's commit log? We should have
> >>>> mentioned it in the changes log. It did respond to your comments.
> >>>> There's text that says "the filter manages the lifetime of the
> >>>> pipeline" - which in the future could include not only tc but XDP but
> >>>> also the hardware path (in the form of a file that gets loaded). I am
> >>>> not sure if that message is clear. Your angle being this is layer
> >>>> violation. In the last discussion i asked you for suggestions and we
> >>>> went the tcx route, which didnt make sense, and then you didnt
> >>>> respond.
> >>> [...]
> >>>
> >>>>> Also as mentioned earlier I don't think tc should hold references on
> >>>>> XDP programs in here. It doesn't make any sense aside from the fact
> >>>>> that the cls_p4 is also not doing anything with it. This is something
> >>>>> that a user space control plane should be doing i.e. managing a XDP
> >>>>> link on the target device.
> >>>>
> >>>> This is the same argument about layer violation that you made earlier.
> >>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf
> >>>> blob(s) but for example in the future (discussions are still ongoing
> >>>> with vendors who have P4 NICs) a filter could be loaded to also
> >>>> specify the location of the hardware blob.
> >>>
> >>> Ah, so there is a plan to eventually add HW offload support for cls_p4?
> >>> Or is this only specifiying a location of a blob through some opaque
> >>> cookie value from user space?
> >>
> >> Current thought process is it will be something along these lines (the
> >> commit provides more details):
> >>
> >> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
> >> prog type hw filename "mypnameprog.o" ... \
> >> prog type xdp obj $PARSER.o section parser/xdp pinned_link
> >> /sys/fs/bpf/mylink \
> >> action bpf obj $PROGNAME.o section prog/tc-ingress
> >>
> >> These discussions are still ongoing - but that is the current
> >> consensus. Note: we are not pushing any code for that, but hope it
> >> paints the bigger picture....
> >> The idea is the cls p4 owns the lifetime of the pipeline. Installing
> >> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
> >> of the refcounts to make sure the pipeline and its components stays
> >> alive.
> >> There could be multiple such filters - when someone deletes the last
> >> filter, then it is safe to delete the pipeline.
> >> Essentially the filter manages the lifetime of the pipeline.
> >>
> >>>> I would be happy with a suggestion that gets us moving forward with
> >>>> that context in mind.
> >>>
> >>> My question on the above is mainly what does it bring you to hold a
> >>> reference on the XDP program? There is no guarantee that something else
> >>> will get loaded onto XDP, and then eventually the cls_p4 is the only
> >>> entity holding the reference but w/o 'purpose'. We do have BPF links
> >>> and the user space component orchestrating all this needs to create
> >>> and pin the BPF link in BPF fs, for example. An artificial reference
> >>> on XDP prog feels similar as if you'd hold a reference on an inode
> >>> out of tc.. Again, that should be delegated to the control plane you
> >>> have running interacting with the compiler which then manages and
> >>> loads its artifacts. What if you would also need to set up some
> >>> netfilter rules for the SW pipeline, would you then embed this too?
> >>
> >> Sorry, a slight tangent first:
> >> P4 is self-contained, there are a handful of objects that are defined
> >> by the spec (externs, actions, tables, etc) and we model them in the
> >> patchset, so that part is self-contained. For the extra richness such
> >> as the netfilter example you quoted - based on my many years of
> >> experience deploying SDN - using daemons(sorry if i am reading too
> >> much in what I think you are implying) for control is not the best
> >> option i.e you need all kinds of coordination - for example where do
> >> you store state, what happens when the daemon dies, how do you
> >> graceful restarts etc. Based on that, if i can put things in the
> >> kernel (which is essentially a "perpetual daemon", unless the kernel
> >> crashes) it's a lot simpler to manage as a source of truth especially
> >> when there is not that much info. There is a limit when there are
> >> multiple pieces (to use your netfilter example) because you need
> >> another layer to coordinate things.
>
> 'source of truth' for the various attach points or BPF links, yes, but in
> this case here it is not, since the source of truth on what is attached
> is not in cls_p4 but rather on the XDP link. How do you handle the case
> when cls_p4 says something different to what is /actually/ attached? Why
> is it not enough to establish some convention in user space, to pin the
> link and retrieve/update from there when needed? Like everyone else does.
> ... even if you consider iproute2 your "control plane" (which I have the
> feeling you do)?
>
> >> Re: the XDP part - our key reason is mostly managerial, in that the
> >> filter is the lifetime manager of the pipeline; and that if i dump
>
> This is imho the problematic part which feels like square peg in round
> hole, trying to fit this whole lifetime manager of the pipeline into
> the cls_p4 filter. We agree to disagree here. Instead of reusing
> individual building blocks from user space, this tries to cramp control
> plane parts into the kernel for which its not a great fit with what is
> build here as-is.
>
> >> that filter i can see all the details in regards to the pipeline(tc,
> >> XDP and in future hw, etc) in one spot. You are right, the link
> >> pinning is our protection from someone replacing the XDP prog (this
> >> was a tip from Toke in the early days) and the comparison of tc
> >> holding inode is apropos.
> >> There's some history: in the early days we were also using metadata
> >> which comes from the XDP program at the tc layer if more processing
> >> was to be done (and there was extra metadata which told us which XDP
> >> prog produced it which we would vet before trusting the metadata).
> >> Given all the above, we should still be able to hold this info without
> >> necessarily holding the extra refcount and be able to see this detail.
> >> So we can remove the refcounting.
> >
> > Daniel?
>
> The refcount should definitely be removed, but then again, see the point
> above in that it is inconsistent information. Why can't this be done in
> user space with some convention in your user space control plane - if you
> take iproute2, then why it cannot pin the link in a bpf fs instance and
> retrieve it from there?
Ok, Daniel - let's do this so we can move forward. I am getting
exhausted, we've been going at this for a year now. As a compromise: I
will remove the support for XDP altogether from the filter. We will
still reference the XDP program in the CLI and infact load and pin it
that way but the filter will not be adding a refcount in the kernel as
in the posted patch.
cheers,
jamal
Powered by blists - more mailing lists