netdev - Re: CONFIG_NET_TC_SKB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68c6668c-f316-2ceb-31b0-8197d22990ae@mellanox.com>
Date:   Mon, 23 Sep 2019 16:56:40 +0000
From:   Paul Blakey <paulb@...lanox.com>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>
CC:     Pravin Shelar <pshelar@....org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Vlad Buslov <vladbu@...lanox.com>,
        David Miller <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Jiri Pirko <jiri@...nulli.us>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Simon Horman <simon.horman@...ronome.com>,
        Or Gerlitz <gerlitz.or@...il.com>
Subject: Re: CONFIG_NET_TC_SKB_EXT


On 9/23/2019 12:47 AM, Jakub Kicinski wrote:
> On Sun, 22 Sep 2019 14:51:44 +0300, Paul Blakey wrote:
>> The skb extension is currently used for miss path of software offloading OvS rules with recirculation to tc.
>> However, we are also preparing patches to support the hardware side of things.
>>
>> After the userspace OvS patches to support connection tracking, we'll have two users for
>> tc multi chain rules, one from those actually using tc and those translated from OvS rules.
>>
>> With both of these use cases, there is similar issue of 'miss', from hardware to tc and tc to OvS and the skb
>> extension will serve to recover from both.
>>
>> Consider, for example, the following multi chain tc rules:
>>
>> tc filter add dev1 ... chain 0 flower dst_mac aa:bb:cc:dd:ee:ff action pedit munge ex set mac dst 02:a0:98:4e:8f:d1 pipe action goto chain 1
>> tc filter add dev1 ... chain 1 flower ip_dst 1.1.1.1 action mirred egress redirect dev2
>> tc filter add dev1 ... chain 1 flower ip_dst 2.2.2.2 action mirred egress redirect dev2
>>
>> It's possible we offload the first two rules, but fail to offload the third rule, because of some hardware failure (e.g unsupported match or action).
>> If a packet with (dst_mac=aa:bb:cc:dd:ee:ff) and (dst ip=2.2.2.2) arrives,
>> we execute the goto chain 1 in hardware (and the pedit), and continue in chain 1, where we miss.
>>
>> Currently we re-start the processing in tc from chain 0, even though we already did part of the processing in hardware.
>> The match on the dst_mac will fail as we already modified it, and we won't execute the third rule action.
>> In addition, if we did manage to execute any software tc rules, the packet will be counted twice.
>>
>> We'll re-use this extension to solve this issue that currently exists by letting drivers tell tc on which chain to start the classification.
>>
>> Regarding the config, we suggest changing the default to N and letting users decide to enable it, see attached patch.
> Partial offloads are very hard. Could we possibly take a page out of
> routing offload's book and do a all or nothing offload in presence of
> multiple chains?

Hi,

Routing is a narrow scenario which this approach might work, but the 
other way of offloading everything or nothing has it's disadvantages too.

Even following this approach in tc only is challenging for some 
scenarios, consider the following tc rules:

tc filter add dev1 ... chain 0 flower <match1> action goto chain 1
tc filter add dev1 ... chain 0 flower <match2> action goto chain 1
tc filter add dev1 ... chain 0 flower <match3> action goto chain 1
..
..
tc filter add dev1 ... chain 0 flower ip_dst <match1000> action goto chain 1

tc filter add dev1 ... chain 1 flower ip_proto tcp action tunnel_key set dst_ip 8.8.8.8 action mirred egress redirect dev2


Now, you could offload 1000 'merged' rules that do the equivalent of:
tc filter add dev1 ... chain 0 flower <match1> ip_proto tcp action tunnel_key set dst_ip 8.8.8.8 action mirred egress redirect dev2
tc filter add dev1 ... chain 0 flower <match2> ip_proto tcp action tunnel_key set dst_ip 8.8.8.8 action mirred egress redirect dev2
...
tc filter add dev1 ... chain 0 flower <match1000> ip_proto tcp action tunnel_key set dst_ip 8.8.8.8 action mirred egress redirect dev2

And each of this hardware rule will update the correct software action counter, so the first hardware rule updates both the chain 1 rule counter and
chain 0 match1 rule counter.
This will work, but now you take dependencies on each of the children rules, so if you delete the chain 1 rule, you need
to invalidate all the 1000 hardware rules. And what about route updates or neigh updates for the encap header dst_ip 8.8.8.8?
Any change to that, and you update all the 1000 rules with the new mac address (hardware encapsulation needs to set outer mac address not supplied but tunnel key).
What if we had a depth of 10 chains?


You'd also have to keep track of what fields were originally on the packet, and not a match resulting from a modifications,
See the following set of rules:
tc filter add dev1 ... chain 0 prio 1 flower src_mac <mac1> action pedit munge ex set dst mac <mac2> pipe action goto chain 1
tc filter add dev1 ... chain 0 prio 2 flower dst_mac <mac2> action goto chain 1
tc filter add dev1 ... chain 1 prio 1 flower dst_mac <mac3> action <action3>
tc filter add dev1 ... chain 1 prio 2 flower src_mac <mac1> action <action4>


How would you offload that at the time of adding this rules, and not tracing the actual packet route?
can you tell that the third rule won't be hit? Will you flatten all possible routes and offload that?

Now take what other datapath (ovs, bridges, netfilter) blocks besides tc can do, or actions like ct that need software backing (unless you merge the ct table) and add them to the mix.
A single point of unsupported match/action between those, and we won't be able to offload the entire path, we might
even spend a lot of resources (trying to merge) before we even know it can't be offloaded, instead of doing what we can support.


>> ------------------------------------------------------------
>>
>> Subject: [PATCH net-next] net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
>>
>> This a new feature, it is prefered that it defaults to N.
>>
>> Fixes: 95a7233c452a ('net: openvswitch: Set OvS recirc_id from tc chain index')
>> Signed-off-by: Paul Blakey<paulb@...lanox.com>
>> ---
>>   net/sched/Kconfig | 1 -
>>   1 file changed, 1 deletion(-)
>>
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index b3faafe..4bb10b7 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -966,7 +966,6 @@ config NET_IFE_SKBTCINDEX
>>   config NET_TC_SKB_EXT
>>   	bool "TC recirculation support"
>>   	depends on NET_CLS_ACT
>> -	default y if NET_CLS_ACT
>>   	select SKB_EXTENSIONS
>>   
>>   	help
>   - Linus suggested we hide this option from user and autoselect if
>     possible.
>   - As Daniel said all distros will enable this.
>   - If correctness is really our concern, giving users an option to
>     select whether they want something to behave correctly seems like
>     a hack of lowest kind. Perhaps it's time to add a CONFIG for TC
>     offload in general? That's a meaningful set of functionality.

As I understand, Linus prefers this new feature to be default N and 
that's the what the attached patch.

The changes to the OVS datapath (static key and DP_SET command) with the 
userspace probe will ensure the correctness when the extension is disabled,

patch that I've yet to send is here:

https://www.spinics.net/lists/netdev/msg594175.html

This was going to be a part OvS userspace connection tracking offload 
patches to OvS userspace.

Paul.