[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161030163836.GC1686@nanopsycho.orion>
Date: Sun, 30 Oct 2016 17:38:36 +0100
From: Jiri Pirko <jiri@...nulli.us>
To: Thomas Graf <tgraf@...g.ch>
Cc: John Fastabend <john.fastabend@...il.com>,
Jakub Kicinski <kubakici@...pl>, netdev@...r.kernel.org,
davem@...emloft.net, jhs@...atatu.com, roopa@...ulusnetworks.com,
simon.horman@...ronome.com, ast@...nel.org, daniel@...earbox.net,
prem@...efootnetworks.com, hannes@...essinduktion.org,
jbenc@...hat.com, tom@...bertland.com, mattyk@...lanox.com,
idosch@...lanox.com, eladr@...lanox.com, yotamg@...lanox.com,
nogahf@...lanox.com, ogerlitz@...lanox.com, linville@...driver.com,
andy@...yhouse.net, f.fainelli@...il.com, dsa@...ulusnetworks.com,
vivien.didelot@...oirfairelinux.com, andrew@...n.ch,
ivecera@...hat.com,
Maciej Żenczykowski <zenczykowski@...il.com>
Subject: Re: Let's do P4
Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@...g.ch wrote:
>On 10/30/16 at 08:44am, Jiri Pirko wrote:
>> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@...il.com wrote:
>> >On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>> >> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> >>> Hi all.
>> >>>
>> >>> The network world is divided into 2 general types of hw:
>> >>> 1) network ASICs - network specific silicon, containing things like TCAM
>> >>> These ASICs are suitable to be programmed by P4.
>> >>> 2) network processors - basically a general purpose CPUs
>> >>> These processors are suitable to be programmed by eBPF.
>> >>>
>> >>> I believe that by now, the most people came to a conclusion that it is
>> >>> very difficult to handle both types by either P4 or eBPF. And since
>> >>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>> >>> as well. Here's a plan:
>> >>>
>> >>> 1) Define P4 intermediate representation
>> >>> I cannot imagine loading P4 program (c-like syntax text file) into
>> >>> kernel as is. That means that as the first step, we need find some
>> >>> intermediate representation. I can imagine someting in a form of AST,
>> >>> call it "p4ast". I don't really know how to do this exactly though,
>> >>> it's just an idea.
>> >>>
>> >>> In the end there would be a userspace precompiler for this:
>> >>> $ makep4ast example.p4 example.ast
>> >>
>> >> Maybe stating the obvious, but IMHO defining the IR is the hardest part.
>> >> eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF. The
>> >> AST/IR for switch pipelines should allow for similar flexibility.
>> >> Looser coupling would also protect us from changes in spec of the high
>> >> level language.
>
>My assumption was that a new IR is defined which is easier to parse than
>eBPF which is targeted at execution on a CPU and not indented for pattern
>matching. Just looking at how llvm creates different patterns and reorders
>instructions, I'm not seeing how eBPF can serve as a general purpose IR
>if the objective is to allow fairly flexible generation of the bytecode.
>Hence the alternative IR serving as additional metadata complementing the
>eBPF program.
Agreed.
[...]
>> >... And merging threads here with Jiri's email ...
>> >
>> >> If you do p4>ebpf in userspace, you have 2 apis:
>> >> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
>> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>> >>
>> >> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
>> >>
>> >> What I believe is correct is to have one api:
>> >> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
>> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>
>I understand what you mean with two APIs now. You want a single IR
>block and divide the SW/HW part in the kernel rather than let llvm or
>something else do it.
Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
|
| +--> ebpf engine
| |
| |
| compilerB
| ^
| |
p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
|
userspace | kernel
|
Now please consider runtime API for rule insertion/removal/stats/etc.
Also, the single API is cls_p4 here:
|
|
|
|
| ebpf map fillup
| ^
| |
p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
|
userspace | kernel
>
>> >Couple comments around this, first adding yet another IR in the kernel
>> >and another JIT engine to map that IR on to eBPF or hardware vendor X
>> >doesn't get me excited. Its really much easier to write these as backend
>> >objects in LLVM. Not saying it can't be done just saying it is easier
>> >in LLVM. Also we already have the LLVM code for P4 to LLVM-IR to eBPF.
>> >In the end this would be a reasonably complex bit of code in
>> >the kernel only for hardware offload. I have doubts that folks would
>> >ever use it for software only cases. I'm happy to admit I'm wrong here
>> >though.
>>
>> Well for hw offload, every driver has to parse the IR (whatever will it
>> be in) and program HW accordingly. Similar parsing and translation would
>> be needed for SW path, to translate into eBPF. I don't think it would be
>> more complex than in the drivers. Should be fine.
>
>I'm not sure I see why anyone would ever want to use an IR for SW
>purposes which is restricted to the lowest common denominator of HW.
>A good example here is OpenFlow and how some of its SW consumers
>have evolved with extensions which cannot be mappepd to HW easily.
>The same seems to happen with P4 as it introduces the concept of
>state and other concepts which are hard to map for dumb HW. P4 doesn't
>magically solve this problem, the fundamental difference in
>capabilities between HW and SW remain.
>
>> >So yes using llvm backends creates two paths a hardware mgmt and sw
>> >path but in the hardware + software case typical on the edge the
>> >orchestration and management planes have started to manage the hardware
>> >and software as two blocks of logic for performance SLA logic. Even on
>> >the edge it seems in most cases folks are selling SR-IOV ports and
>> >can't fall back to software and charge for the port. But this is just
>> >one use case I suspect others where it does make sense.
>> >
>> >> In case of 1), the program.p4ast will be either interpreted by new p4
>> >> interpreter, of translated to bpf and interpreted by that. But this
>> >> translation code is part of kernel.
>> >
>> >Finally a couple historic bits. The Flow-API proposed in Ottawa was
>> >mechanically generated from an original P4 draft. At the time I was
>> >working fairly closely with both the hardware and compiler folks. If
>> >there is interest we could use that as a base IR for hardware. It has
>> >a simple mapping to/from the original P4 spec. The newer P4 specs are
>> >significantly more complex by the way.
>>
>> Yeah, I was also thinking about something similar to your Flow-API,
>> but we need something more generic I believe.
>>
>> >We also have an emulated path also auto-generated from compiler tools
>> >that creates eBPF code from the IR so this would give you the software
>> >fall-back.
>>
>> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
>> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
>> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
>> strongly believe there should be a single kernel API for p4 SW+HW - for
>> both p4 program insertion and runtime configuration.
>
>I think you misunderstand me. This is not what I'm proposing at all.
>In either model, the kernel receives the same IR and can reject.
>
>The rule is very clear: we can't allow to program anything that the
>kernel is not capable of doing in SW, right? That was the key take
>away from that discussion.
***
Exactly. But if you treat p4ast as a "metadata" of ebpf program destined
solely to setup HW, that in my opinion is a bypass. Because the ebpf part
and p4ast part could have no relacionship with each other. So I see it as
2 independent APIs. One for SW, one for HW. And having this kind od API
for hw only is a bypass.
Plus the thing I cannot imagine in the model you propose is table fillup.
For ebpf, you use maps. For p4 you would have to have a separate HW-only
API. This is very similar to the original John's Flow-API. And therefore
a kernel bypass.
>
>Let's assume we do cls_p4ast HW+SW with an eBPF translator for SW. As a
>user of this, as my program becomes more complex I will hit the wall of
>HW capabilities at some point and either the IR is not expressive
>enough or the driver will reject the program.
That can certainly happen, no matter what model we choose.
>
>I have at least three options now:
>
> 1) I move everything to SW and forget about HW programmability
>
> 2) I let HW bail out when HW can't support it and start parsing from
> scratch in SW. I don't really care how much of it has been done
> in HW, it's a best effort optimization. A use case for this might
> be dropping of packets. This is easy to do with flow based
> offloads as it can be best effort but already difficult when
> programming based on a IR.
>
> 3) I let HW bail out but carry some metadata trying to preserve
> some of the work done. A use case for this would be a router type
> of work where the lookup itself can be expensive and the majority
> of actions taken on packets are simple forwards but a subset of
> actions performed are too complex for HW. You still want to
> preserve the savings from the expensive lookup already performed.
>
You still have a choice to do this:
use cls_bpf SKIP_HW for SW processing
use cls_p4 SKIP_SW for HW processing
That gives you flexibility to program the pipelines separatelly. If a driver
is not capable of handling the selected p4 program, it will refuse to
program the HW. Then it is up to the user to change the program according
to the HW features.
>Even for the simpler 2) I can't just put everything I need into my
>p4ast program because the program will either load in its entirety or
>it won't.
>
>What I would likely end up doing is to write any number of subsets of
>my program which only contain various levels of pieces that are very
>likely to load on my target HW. I then load my full program with tc
>and want a notification if it also loaded into HW. If it HW failed,
>then I want tc to load subset programs with SKIP_SW starting from the
>one with most complexity. I need SKIP_SW because I already have the
>full program loaded and I don't want to go through both the partial
>and full program in case of a HW bail out. Is your proposal to not
>allow for the SKIP_SW?
Definitelly not. User should be able to pass SKIP_SW and SKIP_HW as he is
now able to do it for cls_u32, cls_flower and others.
>
>A more evolved form of this would be to expose capabilities of the HW
>and have the program writer include logic which results in the split
>based on the capabilities of the hardware.
I wonder why p4 does not handle the HW capabilities. At least I did
not find it. It would be certainly nice to have it.
>
>I I understand you correctly, you propose to make this split
>automatically in the kernel somehow.
>
>In either model the kernel receives the same new IR which it can
>reject. No difference. None of the models allow more or less.
>
>In either model, the program can be loaded with SKIP_SW to load a valid
>program into HW only.
>
>In either model, an eBPF program can be loaded at cls_bpf, or a new IR
>can be loaded with SKIP_HW to do SW only.
>
>The only difference I see between the models is whether to include a
>new IR => eBPF compiler in the kernel or not which is going to be
>optional anyway.
The main dirrefence I see is the single API/kernel bypass problem
I described earlier in this email (***)
Powered by blists - more mailing lists