[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMkBHqRU9tprJ-SK3tKMfcGsnydp0UA9cH2ALjpSNyJhig@mail.gmail.com>
Date: Mon, 20 Nov 2023 17:56:50 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: Jiri Pirko <jiri@...nulli.us>, John Fastabend <john.fastabend@...il.com>, netdev@...r.kernel.org,
deb.chatterjee@...el.com, anjali.singhai@...el.com, Vipin.Jain@....com,
namrata.limaye@...el.com, tom@...anda.io, mleitner@...hat.com,
Mahesh.Shirshyad@....com, tomasz.osinski@...el.com, xiyou.wangcong@...il.com,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
vladbu@...dia.com, horms@...nel.org, bpf@...r.kernel.org, khalidm@...dia.com,
toke@...hat.com, mattyk@...dia.com, dan.daly@...el.com,
chris.sommers@...sight.com, john.andy.fingerhut@...el.com
Subject: Re: [PATCH net-next v8 00/15] Introducing P4TC
On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@...nulli.us> wrote:
> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@...atatu.com wrote:
> >>> On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@...nulli.us> wrote:
> >>>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@...atatu.com wrote:
> >>>>> On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@...il.com> wrote:
> >>>>>> Jamal Hadi Salim wrote:
> >>>>>>> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@...il.com> wrote:
> >>>>>>>> Jamal Hadi Salim wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>>> I think I'm judging the technical work here. Bullet points.
> >>>>>>
> >>>>>> 1. p4c-tc implementation looks like it should be slower than a
> >>>>>> in terms of pkts/sec than a bpf implementation. Meaning
> >>>>>> I suspect pipeline and objects laid out like this will lose
> >>>>>> to a BPF program with an parser and single lookup. The p4c-ebpf
> >>>>>> compiler should look to create optimized EBPF code not some
> >>>>>> emulated switch topology.
> >>>>>
> >>>>> The parser is ebpf based. The other objects which require control
> >>>>> plane interaction are not - those interact via netlink.
> >>>>> We published perf data a while back - presented at the P4 workshop
> >>>>> back in April (was in the cover letter)
> >>>>> https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >>>>> But do note: the correct abstraction is the first priority.
> >>>>> Optimization is something we can teach the compiler over time. But
> >>>>> even with the minimalist code generation you can see that our approach
> >>>>> always beats ebpf in LPM and ternary. The other ones I am pretty sure
> >>>>
> >>>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> >>>> this kinds of lookups? I mean in theory, eBPF should be always faster.
> >>>
> >>> We didnt look closely; however, that is not the point - the point is
> >>> the perf difference if there is one, is not big with the big win being
> >>> proper P4 abstraction. For LPM for sure our algorithmic approach is
> >>> better. For ternary the compute intensity in looping is better done in
> >>> C. And for exact i believe that ebpf uses better hashing.
> >>> Again, that is not the point we were trying to validate in those experiments..
> >>>
> >>> On your point of "maps are not that suitable" P4 tables tend to have
> >>> very specific attributes (examples associated meters, counters,
> >>> default hit and miss actions, etc).
> >>>
> >>>>> we can optimize over time.
> >>>>> Your view of "single lookup" is true for simple programs but if you
> >>>>> have 10 tables trying to model a 5G function then it doesnt make sense
> >>>>> (and i think the data we published was clear that you gain no
> >>>>> advantage using ebpf - as a matter of fact there was no perf
> >>>>> difference between XDP and tc in such cases).
> >>>>>
> >>>>>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >>>>>> map. Doing a simple update vs a netlink msg. The argument
> >>>>>> that BPF can't do CRUD (which we had offlist) seems incorrect
> >>>>>> to me. Correct me if I'm wrong with details about why.
> >>>>>
> >>>>> So let me see....
> >>>>> you want me to replace netlink and all its features and rewrite it
> >>>>> using the ebpf system calls? Congestion control, event handling,
> >>>>> arbitrary message crafting, etc and the years of work that went into
> >>>>> netlink? NO to the HELL.
> >>>>
> >>>> Wait, I don't think John suggests anything like that. He just suggests
> >>>> to have the tables as eBPF maps.
> >>>
> >>> What's the difference? Unless maps can do netlink.
> >>>
> >>>> Honestly, I don't understand the
> >>>> fixation on netlink. Its socket messaging, memcpies, processing
> >>>> overhead, etc can't keep up with mmaped memory access at scale. Measure
> >>>> that and I bet you'll get drastically different results.
> >>>>
> >>>> I mean, netlink is good for a lot of things, but does not mean it is an
> >>>> universal answer to userspace<->kernel data passing.
> >>>
> >>> Here's a small sample of our requirements that are satisfied by
> >>> netlink for P4 object hierarchy[1]:
> >>> 1. Msg construction/parsing
> >>> 2. Multi-user request/response messaging
> >>
> >> What is actually a usecase for having multiple users program p4 pipeline
> >> in parallel?
> >
> > First of all - this is Linux, multiple users is a way of life, you
> > shouldnt have to ask that question unless you are trying to be
> > socratic. Meaning multiple control plane apps can be allowed to
> > program different parts and even different tables - think multi-tier
> > pipeline.
> >
> >>> 3. Multi-user event subscribe/publish messaging
> >>
> >> Same here. What is the usecase for multiple users receiving p4 events?
> >
> > Same thing.
> > Note: Events are really not part of P4 but we added them for
> > flexibility - and as you well know they are useful.
> >
> >>> I dont think i need to provide an explanation on the differences here
> >>> visavis what ebpf system calls provide vs what netlink provides and
> >>> how netlink is a clear fit. If it is not clear i can give more
> >>
> >> It is not :/
> >
> > I thought it was obvious for someone like you, but fine - here goes for those 3:
> >
> > 1. Msg construction/parsing: A lot of infra for sending attributes
> > back and forth is already built into netlink. I would have to create
> > mine from scratch for ebpf. This will include not just the
> > construction/parsing but all the detailed attribute content policy
> > validations(even in the presence of hierarchies) that comes with it.
> > And not to forget the state transform between kernel and user space.
> >
> > 2. Multi-user request/response messaging
> > If you can write all the code for #1 above then this should work fine for ebpf
> >
> > 3. Event publish subscribe
> > You would have to create mechanisms for ebpf which either are non
> > trivial or non complete: Example 1: you can put surgeries in the ebpf
> > code to look at map manipulations and then interface it to some event
> > management scheme which checks for subscribed users. Example 2: It may
> > also be feasible to create your own map for subscription vs something
> > like perf ring for event publication(something i have done in the
> > past), but that is also limited in many ways.
>
> I still don't think this answers all the questions on why the netlink
> shim layer. The kfuncs are essentially available to all of tc BPF and
> I don't think there was a discussion why they cannot be done generic
> in a way that they could benefit all tc/XDP BPF users. With the patch
> 14 you are more or less copying what is existing with {cls,act}_bpf
> just that you also allow XDP loading from tc(?). We do have existing
> interfaces for XDP program management.
>
I am not sure i followed - but we are open to suggestions to improve
operation usability.
> tc BPF and XDP already have widely used infrastructure and can be developed
> against libbpf or other user space libraries for a user space control plane.
> With 'control plane' you refer here to the tc / netlink shim you've built,
> but looking at the tc command line examples, this doesn't really provide a
> good user experience (you call it p4 but people load bpf obj files). If the
> expectation is that an operator should run tc commands, then neither it's
> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> to bpf_mprog and plan to also extend this for XDP to have a common look and
> feel wrt networking for developers. Why can't this be reused?
The filter loading which loads the program is considered pipeline
instantiation - consider it as "provisioning" more than "control"
which runs at runtime. "control" is purely netlink based. The iproute2
code we use links libbpf for example for the filter. If we can achieve
the same with bpf_mprog then sure - we just dont want to loose
functionality though. off top of my head, some sample space:
- we could have multiple pipelines with different priorities (which tc
provides to us) - and each pipeline may have its own logic with many
tables etc (and the choice to iterate the next one is essentially
encoded in the tc action codes)
- we use tc block to map groups of ports (which i dont think bpf has
internal access of)
In regards to usability: no i dont expect someone doing things at
scale to use command line tc. The APIs are via netlink. But the tc cli
is must for the rest of the masses per our traditions. Also i really
didnt even want to use ebpf at all for operator experience reasons -
it requires a compilation of the code and an extra loading compared to
what our original u32/pedit code offered.
> I don't quite follow why not most of this could be implemented entirely in
> user space without the detour of this and you would provide a developer
> library which could then be integrated into a p4 runtime/frontend? This
> way users never interface with ebpf parts nor tc given they also shouldn't
> have to - it's an implementation detail. This is what John was also pointing
> out earlier.
>
Netlink is the API. We will provide a library for object manipulation
which abstracts away the need to know netlink. Someone who for their
own reasons wants to use p4runtime or TDI could write on top of this.
I would not design a kernel interface to just meet p4runtime (we
already have TDI which came later which does things differently). So i
expect us to support both those two. And if i was to do something on
SDN that was more robust i would write my own that still uses these
netlink interfaces.
> If you need notifications/subscribe mechanism for map updates, then this
> could be extended.. same way like BPF internals got extended along with the
> sched_ext work, making the core pieces more useful also outside of the latter.
>
Why? I already have this working great right now with netlink.
> The link to below slides are not public, so it's hard to see what is really
> meant here, but I have also never seen an email from the speaker on the BPF
> mailing list providing concrete feedback(?). People do build control planes
> around BPF in the wild, I'm not sure where you take 'flush LEDs' from, to
> me this all sounds rather hand-wavy and trying to brute-force the fixation
> on netlink you went with that is raising questions. I don't think there was
> objection on going with eBPF but rather all this infra for the former for
> a SW-only extension.
There are a handful of people who are holding the slides being
released (will go and chase them after this).
BTW, our experience in regards to usability for eBPF control plane is
the same as Ivan. I was listening to the talk and just nodding along.
You focused too much on the datapath and did a good job there but i am
afraid not so much on usability of the control path. My view is: to
create a back and forth with the kernel for something as complex as we
have using the ebpf system calls vs netlink, you would need to spend a
lot more developer resources in the ebpf case. If you want to call
what i have a "the fixation on netlink" maybe you are fixated on ebpf
syscall?;->
cheers,
jamal
> [...]
> >>>>> I should note: that there was an interesting talk at netdevconf 0x17
> >>>>> where the speaker showed the challenges of dealing with ebpf on "day
> >>>>> two" - slides or videos are not up yet, but link is:
> >>>>> https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >>>>> The point the speaker was making is it's always easy to whip an ebpf
> >>>>> program that can slice and dice packets and maybe even flush LEDs but
> >>>>> the real work and challenge is in the control plane. I agree with the
> >>>>> speaker based on my experiences. This discussion of replacing netlink
> >>>>> with ebpf system calls is absolutely a non-starter. Let's just end the
> >>>>> discussion and agree to disagree if you are going to keep insisting on
> >>>>> that.
Powered by blists - more mailing lists