netdev - Re: [PATCH net-next v9 15/15] p4tc: add P4 classifier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50b4dd0b-94fe-36b2-9a69-51847f8a7712@iogearbox.net>
Date: Tue, 5 Dec 2023 23:32:05 +0100
From: Daniel Borkmann <daniel@...earbox.net>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: John Fastabend <john.fastabend@...il.com>, netdev@...r.kernel.org,
 deb.chatterjee@...el.com, anjali.singhai@...el.com,
 namrata.limaye@...el.com, mleitner@...hat.com, Mahesh.Shirshyad@....com,
 tomasz.osinski@...el.com, jiri@...nulli.us, xiyou.wangcong@...il.com,
 davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
 pabeni@...hat.com, vladbu@...dia.com, horms@...nel.org, khalidm@...dia.com,
 toke@...hat.com, bpf@...r.kernel.org
Subject: Re: [PATCH net-next v9 15/15] p4tc: add P4 classifier

On 12/5/23 5:23 PM, Jamal Hadi Salim wrote:
> On Tue, Dec 5, 2023 at 8:43 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>> On 12/5/23 1:32 AM, John Fastabend wrote:
>>> Jamal Hadi Salim wrote:
>>>> Introduce P4 tc classifier. A tc filter instantiated on this classifier
>>>> is used to bind a P4 pipeline to one or more netdev ports. To use P4
>>>> classifier you must specify a pipeline name that will be associated to
>>>> this filter, a s/w parser and datapath ebpf program. The pipeline must have
>>>> already been created via a template.
>>>> For example, if we were to add a filter to ingress of network interface
>>>> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the
>>>> following command:
>>>
>>> In addition to my comments from last iteration.
>>>
>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>>>>       action bpf obj $PARSER.o section prog/tc-parser \
>>>>       action bpf obj $PROGNAME.o section prog/tc-ingress
>>>
>>> Having multiple object files is a mistake IMO and will cost
>>> performance. Have a single object file avoid stitching together
>>> metadata and run to completion. And then run entirely from XDP
>>> this is how we have been getting good performance numbers.
>>
>> +1, fully agree.
> 
> As I stated earlier: while performance is important it is not the
> highest priority for what we are doing, rather correctness is. We dont
> want to be wrestling with the verifier or some other limitation like
> tail call limits to gain some increase in a few kkps. We are taking a
> gamble with the parser which is not using any kfuncs at the moment.
> Putting them all in one program will increase the risk.

I don't think this is a good reason, this corners you into UAPI which
later on cannot be changed anymore. If you encounter such issues, then
why not bringing up actual concrete examples / limitations you run into
to the BPF community and help one way or another to get the verifier
improved instead? (Again, see sched_ext as one example improving verifier,
but also concrete example bug reports, etc could help.)

> As i responded to you earlier,  we just dont want to lose
> functionality, some sample space:
> - we could have multiple pipelines with different priorities - and
> each pipeline may have its own logic with many tables etc (and the
> choice to iterate the next one is essentially encoded in the tc action
> codes)
> - we want to be able to split the pipeline into parts that can run _in
> unison_ in h/w, xdp, and tc

So parser at XDP, but then you push it up the stack (instead of staying
only at XDP layer) just to reach into tc layer to perform a corresponding
action.. and this just to work around verifier as you say?

> - we use tc block to map groups of ports heavily
> - we use netlink as our control API
> 
>>>> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated
>>>> by the P4 compiler and will be the representation of the P4 program.
>>>> Note that filter understands that $PARSER.o is a parser to be loaded
>>>> at the tc level. The datapath program is merely an eBPF action.
>>>>
>>>> Note we do support a distinct way of loading the parser as opposed to
>>>> making it be an action, the above example would be:
>>>>
>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>>>>       prog type tc obj $PARSER.o ... \
>>>>       action bpf obj $PROGNAME.o section prog/tc-ingress
>>>>
>>>> We support two types of loadings of these initial programs in the pipeline
>>>> and differentiate between what gets loaded at tc vs xdp by using syntax of
>>>>
>>>> either "prog type tc obj" or "prog type xdp obj"
>>>>
>>>> For XDP:
>>>>
>>>> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
>>>>       prog type xdp obj $PARSER.o section parser/xdp \
>>>>       pinned_link /sys/fs/bpf/mylink \
>>>>       action bpf obj $PROGNAME.o section prog/tc-ingress
>>>
>>> I don't think tc should be loading xdp programs. XDP is not 'tc'.
>>
>> For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach()
>> via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with
>> IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow
>> multi-user attachment. tc kernel code should not add yet another way of attaching XDP,
>> this should just reuse existing uapi infra instead from userspace control plane side.
> 
> I am probably missing something. We are not loading the XDP program -
> it is preloaded, the only thing the filter does above is grabbing a
> reference to it. The P4 pipeline in this case is split into a piece
> (the parser) that runs on XDP and some that runs on tc. And as i
> mentioned earlier we could go further another piece which is part of
> the pipeline may run in hw. And infact in the future a compiler will
> be able to generate code that is split across machines. For our s/w
> datapath on the same node the only split is between tc and XDP.

So it is even worse from a design PoV. The kernel side allows XDP program
to be passed to cls_p4, but then it's not doing anything but holding a
reference to that BPF program. Iow, you need anyway to go the regular way
of bpf_xdp_link_attach() or dev_change_xdp_fd() to install XDP. Why is the
reference even needed here, why it cannot be done in user space from your
control plane? This again, feels like a shim layer which should live in
user space instead.

>>>> The theory of operations is as follows:
>>>>
>>>> ================================1. PARSING================================
>>>>
>>>> The packet first encounters the parser.
>>>> The parser is implemented in ebpf residing either at the TC or XDP
>>>> level. The parsed header values are stored in a shared eBPF map.
>>>> When the parser runs at XDP level, we load it into XDP using tc filter
>>>> command and pin it to a file.
>>>>
>>>> =============================2. ACTIONS=============================
>>>>
>>>> In the above example, the P4 program (minus the parser) is encoded in an
>>>> action($PROGNAME.o). It should be noted that classical tc actions
>>>> continue to work:
>>>> IOW, someone could decide to add a mirred action to mirror all packets
>>>> after or before the ebpf action.
>>>>
>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>>>>       prog type tc obj $PARSER.o section parser/tc-ingress \
>>>>       action bpf obj $PROGNAME.o section prog/tc-ingress \
>>>>       action mirred egress mirror index 1 dev $P1 \
>>>>       action bpf obj $ANOTHERPROG.o section mysect/section-1
>>>>
>>>> It should also be noted that it is feasible to split some of the ingress
>>>> datapath into XDP first and more into TC later (as was shown above for
>>>> example where the parser runs at XDP level). YMMV.
>>>
>>> Is there any performance value in partial XDP and partial TC? The main
>>> wins we see in XDP are when we can drop, redirect, etc the packet
>>> entirely in XDP and avoid skb altogether.
>>>
>>>> Co-developed-by: Victor Nogueira <victor@...atatu.com>
>>>> Signed-off-by: Victor Nogueira <victor@...atatu.com>
>>>> Co-developed-by: Pedro Tammela <pctammela@...atatu.com>
>>>> Signed-off-by: Pedro Tammela <pctammela@...atatu.com>
>>>> Signed-off-by: Jamal Hadi Salim <jhs@...atatu.com>
>>
>> The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side
>> we moved away from this some time ago for the benefit of a better management
>> API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests
>> around this), as mentioned earlier. Please use this instead for your userspace
>> control plane, otherwise we are repeating the same mistakes from the past again
>> that were already fixed.
> 
> Sorry, that is your use case for kubernetes and not ours. We want to

There is nothing specific to k8s, it's generic infrastructure for tc BPF
and also used outside of k8s scope; please double-check the selftests to
get a picture of the API and libbpf integration.

> use the tc infra. We want to use netlink. I could be misreading what
> you are saying but it seems that you are suggesting that tc infra is
> now obsolete as far as ebpf is concerned? Overall: It is a bit selfish
> to say your use case dictates how other people use ebpf. ebpf is just
> a means to an end for us and _is not the end goal_ - just an infra
> toolset.

Not really, the infrastructure is already there and ready to be used and
it supports basic building blocks such as BPF links, relative prog/link
dependency resolution, etc, where none of it can be found here. The
problem is "we want to use netlink" which is even why you need to push
down things like XDP prog, but it's broken by design, really. You are
trying to push down a control plane into netlink which should have been
a framework in user space.

> If you feel we should unify the P4 classifier with the tc ebpf
> classifier etc then we are going to need some changes that are not
> going to be useful for other people. And i dont see the point in that.
> 
> cheers,
> jamal
> 
>> Therefore, from BPF side:
>>
>> Nacked-by: Daniel Borkmann <daniel@...earbox.net>
>>
>> Cheers,
>> Daniel