netdev - Re: [PATCH net-next RFC 00/20] Introducing P4TC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoM=u-VSDZAifwTiOy8vXAGX7Hwg4rdea62-kNFGsHj7ObQ@mail.gmail.com>
Date:   Mon, 30 Jan 2023 09:06:50 -0500
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>
Cc:     Jiri Pirko <jiri@...nulli.us>,
        John Fastabend <john.fastabend@...il.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Stanislav Fomichev <sdf@...gle.com>,
        Jamal Hadi Salim <hadi@...atatu.com>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        kernel@...atatu.com, deb.chatterjee@...el.com,
        anjali.singhai@...el.com, namrata.limaye@...el.com,
        khalidm@...dia.com, tom@...anda.io, pratyush@...anda.io,
        xiyou.wangcong@...il.com, davem@...emloft.net, edumazet@...gle.com,
        pabeni@...hat.com, vladbu@...dia.com, simon.horman@...igine.com,
        stefanc@...vell.com, seong.kim@....com, mattyk@...dia.com,
        dan.daly@...el.com, john.andy.fingerhut@...el.com
Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC

So i dont have to respond to each email individually, I will respond
here in no particular order. First let me provide some context, if
that was already clear please skip it. Hopefully providing the context
will help us to focus otherwise that bikeshed's color and shape will
take forever to settle on.

__Context__

I hope we all agree that when you have 2x100G NIC (and i have seen
people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
visualize: one 25G port is 35Mpps unidirectional. So "software stack"
is not the answer. You need to offload. I would argue further that in
the near future a lot of the stuff including transport will eventually
have to partially or fully move to hardware (see the HOMA keynote for
a sample space[0]). CPUs are not going to keep up with the massive IO
requirements. I am not talking about offload meaning NIC vendors
providing you Checksum or clever RSS or some basic metadata or
timestamp offload; I think those will continue to be needed - but that
is a different scope altogether. Neither are we trying to address
transport offload in P4TC.

I hope we also agree that the MAT construct is well understood and
that we have good experience in both sw (TC)
and hardware deployments over many years. P4 is a _standardized_
specification for addressing these constructs.
P4 is by no means perfect but it is an established standard. It is
already being used to provide requirements to NIC vendors today
(regardless of the underlying implementation)

So what are we trying to achieve with P4TC? John, I could have done a
better job in describing the goals in the cover letter:
We are going for MAT sw equivalence to what is in hardware. A two-fer
that is already provided by the existing TC infrastructure.
Scriptability is not a new idea in TC (see u32 and pedit and others in
TC). IOW, we are reusing and plugging into a proven and deployed
mechanism with a built-in policy driven, transparent symbiosis between
hardware offload and software that has matured over time. You can take
a pipeline or a table or actions and split them between hardware and
software transparently, etc. This hammer already meets our goals.
It's about using the appropriate tool for the right problem. We are
not going to rewrite that infra in rust or ebpf just because. If the
argument is about performance (see point above on 200G ports): We care
about sw performance but more importantly we care about equivalence. I
will put it this way: if we are confronted with a design choice of
picking whether we forgo equivalence in order to get better sw
performance, we are going to trade off performance. You want wire
speed performance then offload.

__End Context__

So now let me respond to the points raised.

Jiri, i think one of the concerns you have is that there is no way to
generalize the different hardware by using a single abstraction since
all hardware may have different architectures (eg whether using RMT vs
DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM,  etc) which may
necessitate doing things like underlying table reordering, merging,
sorting etc. The consensus is that it is the vendor driver that is
responsible for “transforming” P4 abstractions into whatever your
hardware does. The standardized abstraction is P4. Each P4 object
(match or action) has an ID and attributes - just like we do today
with flower with exception it is not hard coded in the kernel as we do
today. So if the tc ndo gets a callback to add an entry that will
match header and/or metadata X on table Y and execute action Z, it
should take care of figuring out how that transforms into its
appropriate hardware layout. IOW, your hardware doesnt have to be P4
native it just has to accept the constructs.
To emphasize again that folks are already doing this:  see the MS DASH
project where you have many NIC vendors (if i am not mistaken xilinx,
pensando, intel mev, nvidia bluefield, some startups, etc) -  they all
consume P4 and may implement differently.

The next question is how do you teach the driver what the different P4
object IDs mean and load the P4 objects for the hardware? We need to
have a consensus on that for sure - there are multiple approaches that
we explored: you could go directly from netlink using the templating
DSL; you could go via devlink or you can have a hybrid of the two.
Initially different vendors thought differently but they seem to
settle on devlink. From a TC perspective the ndo callbacks for runtime
dont change.
Toke, I labelled that one option as IMpossible as a parody - it is
what the vendors are saying today and my play on words is "even
impossible says IM possible". The challenge we have is that while some
vendor may have a driver and an approach that works, we need to have a
consensus instead of one vendor dictating the approach we use.

To John, I hope i have addressed some of your commentary above. The
current approach vendors are taking is total bypass of the kernel for
offload (we are getting our asses handed to us). The kernel is used to
configure control then it punts to user space and then you invoke a
vendor proprietary API. And every vendor has their own API. If you are
sourcing the NICs from multiple vendors then this is bad for the
consumer (unless you are a hyperscaler in which case almost all are
writing their own proprietary user space stacks). Are you pitching
that model?
The synced hardware + sw is already provided by TC - why punt to user space?

cheers,
jamal

[0] https://netdevconf.info/0x16/session.html?keynote-ousterhout

On Mon, Jan 30, 2023 at 6:27 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Jiri Pirko <jiri@...nulli.us> writes:
>
> >>P4TC as SW/HW running same P4:
> >>
> >>1. This doesn't need to be done in kernel. If one compiler runs
> >>   P4 into XDP or TC-BPF that is good and another compiler runs
> >>   it into hw specific backend. This satisifies having both
> >>   software and hardware implementation.
> >>
> >>Extra commentary: I agree we've been chatting about this for a long
> >>time but until some vendor (Intel?) will OSS and support a linux
> >>driver and hardware with open programmable parser and MAT. I'm not
> >>sure how we get P4 for Linux users. Does it exist and I missed it?
> >
> >
> > John, I think that your summary is quite accurate. Regarding SW
> > implementation, I have to admit I also fail to see motivation to have P4
> > specific datapath instead of having XDP/eBPF one, that could run P4
> > compiled program. The only motivation would be that if somehow helps to
> > offload to HW. But can it?
>
> According to the slides from the netdev talk[0], it seems that
> offloading will have to have a component that goes outside of TC anyway
> (see "Model 3: Joint loading" where it says "this is impossible"). So I
> don't really see why having this interpreter in TC help any.
>
> Also, any control plane management feature specific to managing P4 state
> in hardware could just as well manage a BPF-based software path on the
> kernel side instead of the P4 interpreter stuff...
>
> -Toke
>
> [0] https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted
>