netdev - Re: CONFIG_NET_TC_SKB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4f99e2b6-0f09-9d2c-6300-dfc884d501a8@mellanox.com>
Date:   Tue, 24 Sep 2019 11:48:53 +0000
From:   Paul Blakey <paulb@...lanox.com>
To:     Edward Cree <ecree@...arflare.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>
CC:     Pravin Shelar <pshelar@....org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Vlad Buslov <vladbu@...lanox.com>,
        David Miller <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Jiri Pirko <jiri@...nulli.us>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Simon Horman <simon.horman@...ronome.com>,
        Or Gerlitz <gerlitz.or@...il.com>
Subject: Re: CONFIG_NET_TC_SKB_EXT

On 9/23/2019 8:17 PM, Edward Cree wrote:

> On 23/09/2019 17:56, Paul Blakey wrote:
>> Even following this approach in tc only is challenging for some
>> scenarios, consider the following tc rules:
>>
>> tc filter add dev1 ... chain 0 flower <match1> action goto chain 1
>> tc filter add dev1 ... chain 0 flower <match2> action goto chain 1
>> tc filter add dev1 ... chain 0 flower <match3> action goto chain 1
>> ..
>> ..
>> tc filter add dev1 ... chain 0 flower ip_dst <match1000> action goto chain 1
>>
>> tc filter add dev1 ... chain 1 flower ip_proto tcp action tunnel_key set dst_ip 8.8.8.8 action mirred egress redirect dev2
> This one is easy, if a packet gets to the end of a chain without matching any rules then it just gets delivered to the uplink (PF), and software TC starts over from the beginning of chain 0.  AFAICT this is the natural hardware behaviour for any design of offload, and it's inherently all-or-nothing.
>> You'd also have to keep track of what fields were originally on the packet, and not a match resulting from a modifications,
>> See the following set of rules:
>> tc filter add dev1 ... chain 0 prio 1 flower src_mac <mac1> action pedit munge ex set dst mac <mac2> pipe action goto chain 1
>> tc filter add dev1 ... chain 0 prio 2 flower dst_mac <mac2> action goto chain 1
>> tc filter add dev1 ... chain 1 prio 1 flower dst_mac <mac3> action <action3>
>> tc filter add dev1 ... chain 1 prio 2 flower src_mac <mac1> action <action4>
> This one is slightly harder in that you have to either essentially 'gather up' actions as you go and only 'commit' them at the end of your processing pipeline (in which case you have to deal with the data hazard you cleverly included in your example), or keep a copy of the original packet around.  But I don't see the data-hazard case as having any realistic use-cases (does anyone actually put actions other than ct before a goto chain?) and it's easy enough to detect it in SW at rule-insertion time.

The 'miss' for all or nothing is easy, but the hard part is combining 
all the paths a packet can take in software to a single 'all or nothing' 
rule in hardware.

And when you do that, any change to any part of the path requires 
updating all the resulting rules. In the above case an update to the 
route of 8.8.8.8 affects the 1000 rules before it.

I can see two approaches to do the all of nothing approach, one is to 
statically analyze the software policy (tc rules in this case), and 
carefully offload

all possible paths. For that, you (who? the driver? tc? ovs?) would need 
to construct a graph, which even if you could do, there isn't any 
infrastructure for it, it would only valid while the configuration stays 
the same,

as any change to an edge on the graph requires re-analyzing it and 
possible changing tens of thousands of rules. In case of connection 
tracking, maybe hundred thousands...

Or the second approach, to save the expensive analysis, and unused 
hardware rules in the above, trace the path of processing the packet 
(gather and commit), save it somewhere on the SKB or per CPU, and 
creating new rules - basically a hardware cache.

But again, any change to the policy, and large chunks of this cache 
would be invalid.



Regarding the use-case, For OvS, as far as I know, it uses 
re-circulation for ct action, and dp hash actions, since it needs kernel 
help.

This extension is first introduced for ovs->tc offloads of connection 
tracking, but as we said, we are going to re-use it for tc->hardware in 
the same manner.

And that can take on many use-cases.  The above is a valid use-case if 
the controller isn't OvS.

The controller learn rules and insert the first 1000 rules matching on 
some 5-tuple in chain 0, but then when you want to encap the packets, 
you jump to the encap chain.

This allows you to update the encap action for all rules, and segment 
the logic for your policy.

tc chains are always used this way for a lot of cases which is 
segmenting policy logic.


> FWIW I think the only way this infra will ever be used is: chain 0 rules match <blah> action ct zone <n> goto chain <m>; chain <m> rules match ct_state +xyz <blah> action <actually do stuff to packet>.  This can be supported with a hardware pipeline with three tables in series in a fairly natural way — and since all the actions that modify the packet (rather than just directing subsequent matching) are at the end, the natural miss behaviour is deliver-unmodified.
>
> I tried to explain this back on your [v3] net: openvswitch: Set OvS recirc_id from tc chain index, but you seemed set on your approach so I didn't persist.  Now it looks like maybe I should have...

What you describe with the three tables (chain 0, chain X, and ct table 
for each ct zone), follows the software model and requires continuing 
from some miss point

and is what we plan on doing.

We offload something like:

chain 0 dst_mac aa:bb:cc:dd:ee:ff ct_state -trk  action ct (goto special 
CT table), action goto chain X

chain X ct_state +trk+est action forward to port

And a table that replicates the ct zone table,  which has something like

<match on tuple 1.... > set metadata ct_state = +est, action continue

<match on tuple 2.... > set metadata ct_state = +est, action continue

...

Lots tuples...

What if you 'miss' on the match for the tuple? You already did some 
processing in hardware, so either you revert those, or you continue in 
software where you left off  (the action ct).

We want to preserve the current software model, just as tc can do part 
of the processing and it will continue to the rest of the pipepline, be 
it OvS, bridge, or loopback. And in hardware we would do the same.


The all or nothing approach will require changing the software model to 
allow

merging the ct zone table matches into the hardware rules and offload 
something like:

dst_mac aa:bb:cc:dd:ee:ff  < match on tuple 1> fwd port

dst_mac aa:bb:cc:dd:ee:ff  < match on tuple 2> fwd port

...

Lots of 'merged' rules.

You delete the rule with action ct, and you have to delete all this 
merged, instead of just one rule.


Tracing the packet, or merging rules will require new infra to support this.