netdev - Re: Let's do P4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161030180103.GD1686@nanopsycho.orion>
Date:   Sun, 30 Oct 2016 19:01:03 +0100
From:   Jiri Pirko <jiri@...nulli.us>
To:     Jakub Kicinski <kubakici@...pl>
Cc:     Thomas Graf <tgraf@...g.ch>,
        John Fastabend <john.fastabend@...il.com>,
        netdev@...r.kernel.org, davem@...emloft.net, jhs@...atatu.com,
        roopa@...ulusnetworks.com, simon.horman@...ronome.com,
        ast@...nel.org, daniel@...earbox.net, prem@...efootnetworks.com,
        hannes@...essinduktion.org, jbenc@...hat.com, tom@...bertland.com,
        mattyk@...lanox.com, idosch@...lanox.com, eladr@...lanox.com,
        yotamg@...lanox.com, nogahf@...lanox.com, ogerlitz@...lanox.com,
        linville@...driver.com, andy@...yhouse.net, f.fainelli@...il.com,
        dsa@...ulusnetworks.com, vivien.didelot@...oirfairelinux.com,
        andrew@...n.ch, ivecera@...hat.com,
        Maciej Żenczykowski <zenczykowski@...il.com>
Subject: Re: Let's do P4

Sun, Oct 30, 2016 at 06:45:26PM CET, kubakici@...pl wrote:
>On Sun, 30 Oct 2016 17:38:36 +0100, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@...g.ch wrote:
>> >On 10/30/16 at 08:44am, Jiri Pirko wrote:  
>> >> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@...il.com wrote:  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>> >
>> >My assumption was that a new IR is defined which is easier to parse than
>> >eBPF which is targeted at execution on a CPU and not indented for pattern
>> >matching. Just looking at how llvm creates different patterns and reorders
>> >instructions, I'm not seeing how eBPF can serve as a general purpose IR
>> >if the objective is to allow fairly flexible generation of the bytecode.
>> >Hence the alternative IR serving as additional metadata complementing the
>> >eBPF program.  
>> 
>> Agreed.
>
>Just to clarify my intention here was not to suggest the use of eBPF as
>the IR.  I was merely cautioning against bundling the new API with P4,
>for multiple reasons.  As John mentioned P4 spec was evolving in the
>past.  The spec is designed for HW more capable than the switch ASICs we
>have today.  As vendors move to provide more configurability we may need
>to extend the API beyond P4.  We may want to extend this API to for SW
>hand-offs (as suggested by Thomas) which are not part of P4 spec.  Also
>John showed examples of matchd software which already uses P4 at the
>frontend today and translates it to different targets (eBPF, u32, HW).
>It may just be about the naming but I feel like calling the new API
>more generically, switch AST or some such may help to avoid unnecessary
>ties and confusion.

Well, that basically means to create "something" that could be be used
to translate p4 source to. Not sure how exactly this "something" should
look like and how different would it be from p4. I thought it might
be good to benefit from the p4 definition and use it directly. Not sure.


>
>> >I understand what you mean with two APIs now. You want a single IR
>> >block and divide the SW/HW part in the kernel rather than let llvm or
>> >something else do it.  
>> 
>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>> 
>>                                  |
>>                                  |               +--> ebpf engine
>>                                  |               |
>>                                  |               |
>>                                  |           compilerB
>>                                  |               ^
>>                                  |               |
>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>                                  |
>>                        userspace | kernel
>>                                  |
>>
>> Now please consider runtime API for rule insertion/removal/stats/etc.
>> Also, the single API is cls_p4 here:
>> 
>>                         |
>>                         |            
>>                         |            
>>                         |               
>>                         |            ebpf map fillup
>>                         |               ^
>>                         |               |
>>              p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
>>                         |
>>               userspace | kernel
>>                         
>
>My understanding was that the main purpose of SW eBPF translation would
>be to piggy back on eBPF userspace map API.  This seems not to be the
>case here?  Is "P4 rule" being added via some new API?  From performance

cls_p4 TC classifier.


>perspective the SW AST implementation would probably not be any slower
>than u32, so I don't think we need eBPF for performance.  I must be
>misreading this, if we want eBPF fallback we must extend eBPF with all
>the map types anyway... so we could just use eBPF map API?  I believe
>John has already done some work in this space (see his GitHub :))

I don't think you can use existing BPF maps kernel API. You would still
have to have another API just for the offloaded datapath. And that is
a bypass. I strongly believe we need a single kernel API for both
SW and HW datapath setup and runtime configuration.


>
>As for AST -> eBPF translator in the kernel, IMHO it could be very
>useful.  Since all the drivers will have to implement translators
>anyway, the eBPF translator may help to build a good shared
>infrastructure.  I mean - it could be a starting place for sharing code
>between drivers if done properly.

Agreed.


>
>> >> Well for hw offload, every driver has to parse the IR (whatever will it
>> >> be in) and program HW accordingly. Similar parsing and translation would
>> >> be needed for SW path, to translate into eBPF. I don't think it would be
>> >> more complex than in the drivers. Should be fine.  
>> >
>> >I'm not sure I see why anyone would ever want to use an IR for SW
>> >purposes which is restricted to the lowest common denominator of HW.
>> >A good example here is OpenFlow and how some of its SW consumers
>> >have evolved with extensions which cannot be mappepd to HW easily.
>> >The same seems to happen with P4 as it introduces the concept of
>> >state and other concepts which are hard to map for dumb HW. P4 doesn't
>> >magically solve this problem, the fundamental difference in
>> >capabilities between HW and SW remain.
>> >  
>>  [...]  
>>  [...]  
>>  [...]  
>> >> 
>> >> Yeah, I was also thinking about something similar to your Flow-API,
>> >> but we need something more generic I believe.
>> >>   
>>  [...]  
>> >> 
>> >> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
>> >> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
>> >> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
>> >> strongly believe there should be a single kernel API for p4 SW+HW - for
>> >> both p4 program insertion and runtime configuration.  
>> >
>> >I think you misunderstand me. This is not what I'm proposing at all.
>> >In either model, the kernel receives the same IR and can reject.
>> >
>> >The rule is very clear: we can't allow to program anything that the
>> >kernel is not capable of doing in SW, right? That was the key take
>> >away from that discussion.  
>> 
>> 
>> ***
>> Exactly. But if you treat p4ast as a "metadata" of ebpf program destined
>> solely to setup HW, that in my opinion is a bypass. Because the ebpf part
>> and p4ast part could have no relacionship with each other. So I see it as
>> 2 independent APIs. One for SW, one for HW. And having this kind od API
>> for hw only is a bypass.
>
>+1
>Adding metadata to eBPF programs usually fails because the verification
>that the metadata is correct in the kernel is usually not much easier
>than generating it in the first place.  And not verifying it opens up a
>way of kernel bypass.