netdev - Re: [PATCH RFC net-next 0/2] net:sched: Introduce tc flower2 classifier based on PANDA parser in kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOuuhY_urma1VwZKX12c5bbbj1wwvGLuAjqqBOVpg3g01b0stA@mail.gmail.com>
Date:   Fri, 24 Sep 2021 09:21:52 -0700
From:   Tom Herbert <tom@...anda.io>
To:     John Fastabend <john.fastabend@...il.com>
Cc:     Jamal Hadi Salim <jhs@...atatu.com>,
        Simon Horman <simon.horman@...igine.com>,
        Jiri Pirko <jiri@...nulli.us>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Felipe Magno de Almeida <felipe@...anda.io>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Boris Sukholitko <boris.sukholitko@...adcom.com>,
        Vadym Kochan <vadym.kochan@...ision.eu>,
        Ilya Lifshits <ilya.lifshits@...adcom.com>,
        Vlad Buslov <vladbu@...dia.com>,
        Ido Schimmel <idosch@...sch.org>, paulb@...dia.com,
        Davide Caratti <dcaratti@...hat.com>,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        Amritha Nambiar <amritha.nambiar@...el.com>,
        "Samudrala, Sridhar" <sridhar.samudrala@...el.com>,
        Pedro Tammela <pctammela@...atatu.com>,
        Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [PATCH RFC net-next 0/2] net:sched: Introduce tc flower2
 classifier based on PANDA parser in kernel

On Thu, Sep 23, 2021 at 8:55 PM John Fastabend <john.fastabend@...il.com> wrote:
>
> Jamal Hadi Salim wrote:
> > Geez, I missed all the fun ;->
> >
> > On 2021-09-22 11:25 p.m., John Fastabend wrote:
> > > Tom Herbert wrote:
> > >> On Wed, Sep 22, 2021, 6:29 PM John Fastabend <john.fastabend@...il.com>
> > >> wrote:
> >
> > [..]
> >
> > >> John,
> > >>
> > >> Please look at patch log, there are number of problems that have come up
> > >> flow dissector over the years. Most of this is related to inherent
> > >> inflexibility, limitations, missing support for fairly basic protocols, and
> > >> there's a lot of information loss because of the fixed monolithic data
> > >> structures. I've said it many times: skb_flow_dissect is the function we
> > >> love to hate. Maybe it's arguable, bit I claim it's 2000 lines of spaghetti
> > >> code. I don't think there's anyone to blame for that, this was a
> > >> consequence of evolving very useful feature that isn't really amenable to
> > >> being written in sequence of imperative instructions (if you recall it used
> > >> to be even worse with something like 20 goto's scattered about that defied
> > >> any semblance of logical program flow :-) ).
> > >
> > > OK, but if thats the goal then shouldn't this series target replacing the
> > > flow_dissector code directly? I don't see any edits to ./net/core.
> > >
> >
> > Agreed, replacement of flow dissector should be a focus. Jiri's
> > suggestion of a followup patch which shows how the rest of the consumers
> > of flow dissector could be made to use PANDA is a good idea.
>
> I'de almost propose starting with flow_dissector.c first so we see that the
> ./net/core user for the SW only case looks good. Although I like the idea
> of doing it all in BPF directly so could take a crack at that as well. Then
> compare them.

That's an interesting idea and the intent, but note that one of the
reasons we are able to outperform flow dissectors is that
flow_dissector is parameterized to be generalized whereas PANDA Parser
would provide a customized instance for each use case. This is
especially evident in the bits to configure what data fields extracts,
in PANDA the extraction is explicit so a whole bunch of conditionals
in the datapath are eliminated. Replacing flow dissectors might look
more like creating a parser instance for each caller instead of
calling one function that tries to solve all problems.

>
> >
> > IMO (correct me if i am wrong Tom), flower2 was merely intended to
> > illustrate how one would use PANDA i.e there are already two patches
> > of which the first one is essentially PANDA...
> > IOW,  it is just flower but with flow dissector replaced by PANDA.
> >
> > >>
> > >> The equivalent code in PANDA is far simpler, extensible, and maintainable
> > >> and there are opportunities for context aware optimizations that achieve
> > >> higher performance (we'll post performance numbers showing that shortly).
> > >> It's also portable to different environments both SW and HW.
> > >
> > > If so replace flow_dissector then I think and lets debate that.
> > >
> > > My first question as a flow dissector replacement would be the BPF
> > > flow dissector was intended to solve the generic parsing problem.
> >
> > > Why would Panda be better? My assumption here is that BPF should
> > > solve the generic parsing problem, but as we noted isn't very
> > > friendly to HW offload. So we jumped immediately into HW offload
> > > space. If the problem is tc_flower is not flexible enough
> > > couldn't we make tc_flower use the BPF dissector? That should
> > > still allow tc flower to do its offload above the sw BPF dissector
> > > to hardware just fine.
> > >
> > > I guess my first level question is why did BPF flow dissector
> > > program not solve the SW generic parsing problem. I read the commit
> > > messages and didn't find the answer.
> > >
> >
> > Sorry, you cant replace/flowdissector/BPF such that flower can
> > consume it;-> You are going to face a huge path explosion with the
> > verifier due to the required branching and then resort to all
> > kinds of speacial-cased acrobatics.
> > See some samples of XDP code going from trying to parse basic TCP
> > options to resorting to tricking the verifier.
> > For shits and giggles, as they say in Eastern Canada, try to do
> > IPV6 full parsing with BPF (and handle all the variable length
> > fields).
>
> We parse TLVs already and it works just fine. It requires some
> careful consideration and clang does some dumb things here and
> there, but it is doable. Sure verifier could maybe be improved
> around a few cases and C frontend gets in the way sometimes,
> but PANDA or P4 or other DSL could rewrite in LLVM-IR directly
> to get the correct output.
>
Currently the kernel flow dissector doesn't parse TLVs, for instance
Hop-by-Hop, DestOpts, IP options, and TCP options are just skipped.
TLVs are also the bane of router vendors since they despise
implementing protocols that require serialized processing over
combinatorial collection of elements. We need to get past this since
TLVs are a protocol extensibility which means they need to be a first
class citizen in a programmable parser API. (to be clear I'm not
saying anyone should add hardcordes TLV processing to
__skb_flow_dissect, it's already bloated enough!)

> > Generally:
> > BPF is good for specific smaller parsing tasks; the ebpf flow dissector
> > hook should be trivial to add to PANDA. And despite PANDA being able
> > to generate EBPF - I would still say it depends on the depth of the
> > parse tree to be sensible to use eBPF.
>
> Going to disagree. I'm fairly confident we could write a BPF
> program to do the flow disection. Anyways we can always improve
> the verifier as needed and this helps lots of things not
> just this bit. Also flow dissector will be loaded once at early
> boot most likely so we can allow it to take a bit longer or
> pre-verify it. Just ideas.

Yes, we already have a panda-compiler that converts PANDA-C in
well-optimized eBPF code. Per Jamal's point, that did require breaking
up the program into different tails calls. I believe once we hit four
layers of protocols we do a tail call and also do a tail call for each
instance of TLV processing. Note the
tools/testing/selftests/bpf/progs/bpf_flow.c has to deal with this
also and does this by statically making every L3 protocol into a tail
call (result is more tail calls than equivalent code PANDA).

>
> >
> > Earlier in the thread you said a couple of things that caught my
> > attention:
> >
> >  > I don't think P4 or Panda should be in-kernel. The kernel has a BPF
> >  > parser that can do arbitrary protocol parsing today. I don't see
> >  > a reason to add another thing on the chance a hardware offload
> >  > might come around. Anyways P4/Panda can compile to the BPF parser
> >  > or flower if they want and do their DSL magic on top. And sure
> >  > we might want to improve the clang backends, the existing flower
> >  > classifier, and BPF verifier.
> >  >
> >  >
> >  > Vendors have the ability to code up arbitrary hints today. They just
> >  > haven't open sourced it or made it widely available. I don't see how
> >  > a 'tc' interface would help with this. I suspect most hardware could
> >  > prepend hints or put other arbitrary data in the descriptor or elsewhere.
> >  > The compelling reason to open source it is missing.
> >
> > Please, please _lets not_ encourage vendors to continue
> > keep things proprietary!
>
> Fair enough. Some frustration leaking in from my side knowing
> the hardware has been around for years and we've seen multiple
> proposals but only limited hardware backing. Tom mentioned
> he was working on the hardware angle so perhaps its close.
>
I share that frustration!

> > Statements like "I don't think P4 or Panda should be in-kernel..."
> > are just too strong.
>
> Where I wanted to go with this is P4 and Panda are DSLs in my
> mind. I think we should keep the kernel non-specific to any
> one DSL. We should have a low level generic way to add them
> to the kernel, I think this is BPF. Then we let users pick
> whatever DSL they like and/or make up their own DSL.
>
> Is the counter-argument that Panda is not a DSL, but rather
> a low-level parser builder pattern.

Yes, PANDA is _not_ a DSL in the sense that it uses a new compiler,
tool chain, or skill sets to write a program all of which are required
for using P4. PANDA-C (just to make it look like the analogous CUDA-C
:-) ) is inherently C code that has a program structure for the
"low-level parser builder pattern".

>
> > Instead lets focus on how we can make P4 and other hardware offloads
> > work in conjunction with the kernel (instead of totally bypassing
> > it which is what vendors are doing enmasse already). There are
> > billions of $ invested in these ASICs and lets welcome them into
> > our world. It serves and helps grow the Linux community better.
> > The efforts of switchdev and tc offloading have proven it is possible.
> > Vendors (and i am going to call out Broadcom on the switching side here)
> > are not partaking because they see it as an economical advantage not to
> > partake.
> >
> > We have learnt a lot technically since switchdev/tc offloads happened.
> > So it is doable.
> > The first rule is: In order to get h/w offload to work lets also have
> > digitally equivalent implementation in s/w.
>
> But there is a cost to this its yet another bit of software to
> maintain and review and so on. I'm arguing we already have a generic
> way to implement h/w equivalence and its BPF. So instead of inventing
> another method to do software we can improve the BPF layer. If we need
> to build semantics over it that look consumable to hw so be it.
>
> Also the other core question I still don't understand is how a
> piece of hardware could consume a parse graph piece-meal through
> an interface like proposed in flower2 and generate an arbitrary
> parse graph? On the fly none the less. That feels like some very
> powerful firmware to me.
> And I would prefer open source userspace code (non-kernel) to
> deep magic in firmware. At least then I can see it, patch it,
> fix it, etc.

An instance of a parser is inherently a parse graph, so it follows
that the best representation for a parser is a declarative
representation of the parse graph. There was a watershed paper on this
by Nick McKeown and others in "Design principles for packet parsers".
A common declarative representation is then amenable to instantiation
in a hardware engine which is designed to consume that representation
(i.e. hardware parsers are basically programmable FSMs), and it's
equally straightforward to elicit a parser from a declarative
representation into imperative code for running in CPU. In this
regard, the only salient difference between P4 and PANDA-C is that the
declarative representation can be coded in PANDA as a graph data
structure in standard C, and in P4 the encoding is in an explicit
language construct.

>
> Last thing, I'll point out I got back deep into the hardware debate.
> I'm still not convinced its the right thing to rip out the flow
> dissector piece and replace it with Panda.

Hardware acceleration is one goal, but there are merits in just
considering the software especially in regards to performance and
usability. We'll highlight those benefits in future patch series.

Tom

>
> Thanks!
> John