netdev - Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac24fe71-aca6-ea6c-bbbb-a54a58a52bd1@ovn.org>
Date:   Thu, 9 Feb 2023 13:07:30 +0100
From:   Ilya Maximets <i.maximets@....org>
To:     Marcelo Leitner <mleitner@...hat.com>
Cc:     i.maximets@....org, Paul Blakey <paulb@...dia.com>,
        netdev@...r.kernel.org, Saeed Mahameed <saeedm@...dia.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Oz Shlomo <ozsh@...dia.com>, Jiri Pirko <jiri@...dia.com>,
        Roi Dayan <roid@...dia.com>, Vlad Buslov <vladbu@...dia.com>
Subject: Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss
 to tc action

On 2/9/23 02:09, Marcelo Leitner wrote:
> On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
>> On 2/8/23 19:01, Marcelo Leitner wrote:
>>> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>>>
>>>>
>>>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>>>
>>>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>>>
>>>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>>>> be handled in hardware.
>>>>>>>>>
>>>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>>>> packet modifications before sending it back to software so it can
>>>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>>>
>>>>>>>>> The following is an example configuration of stateless nat
>>>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>>>
>>>>>>>>>    #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>>>    $ ip netns add ns0
>>>>>>>>>    $ ip netns add ns1
>>>>>>>>>    $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>>>    $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>>>    $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>>>    $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>>>
>>>>>>>>>    #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>>>    $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>>>    $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>>>    $ ifconfig enp8s0f0_0 up
>>>>>>>>>    $ ifconfig enp8s0f0_1 up
>>>>>>>>>
>>>>>>>>>    #Original side
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>>>       ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>>>         action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>>>         action csum ip tcp pipe \
>>>>>>>>>         action ct pipe \
>>>>>>>>>         action goto chain 1
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>       ct_state +trk+est \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>       ct_state +trk+new \
>>>>>>>>>         action ct commit pipe \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>
>>>>>>>>>    #Reply side
>>>>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>>>       ct_state -trk ip_proto tcp \
>>>>>>>>>         action ct pipe \
>>>>>>>>>         action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>>>         action csum ip tcp pipe \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>
>>>>>>>>>    #Run traffic
>>>>>>>>>    $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>>>    $ sleep 2 #wait for iperf to fully open
>>>>>>>>>    $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>>>
>>>>>>>>>    #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>>>    $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>
>>>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>>>> reject offloading the above action ct rule.
>>>>>>>>>
>>>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>>>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>>>
>>>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>>>> flags are not per-action.  This may cause confusion among users, if flows
>>>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>>>> processed in SW.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>>>> different series, as this isn't a new problem from this series and existed before
>>>>>>> it, at least with CT rules.
>>>>>>
>>>>>> Hmm, I didn't realize the issue already exists.
>>>>>
>>>>> Maintainers: please give me up to Friday to review this patchset.
>>>>>
>>>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>>>
>>>>> I don't follow. Can someone please rephase the issue please?
>>>>> AFAICT, it is not that the NIC is offloading half of the action list
>>>>> and never executing a part of it. Instead, for established connections
>>>>> the rule will work fully offloaded. While for misses in the CT action,
>>>>> it will simply trigger a miss, like it already does today.
>>>>
>>>> You got it right, and like you said it was like this before so its not
>>>> strictly related by this series and could be in a different patchset. And I
>>>> thought that (extra) flag would mean that it can miss, compared to other
>>>> rules/actions combination that will never miss because they
>>>> don't need sw support.
>>>
>>> This is different from what I understood from Ilya's comment. Maybe I
>>> got his comment wrong, but I have the impression that he meant it in
>>> the sense of having some actions offloaded and some not.
>>> Which I thinkit is not the goal here.
>>
>> I don't really know the code around this patch set well enough, so my
>> thoughts might be a bit irrelevant.  But after reading the cover letter
>> and commit messages in this patch set I imagined that if we have some
>> kind of miss on the N-th action in a list in HW, we could go to software
>> tc, find that action and continue execution from it.  In this case some
>> actions are executed in HW and some are in SW.
> 
> Precisely. :)
> 
>>
>> From the user's perspective, if such tc filter reports an 'in_hw' flag,
>> that would be a bit misleading, IMO.
> 
> I may be tainted or perhaps even biased here, but I don't see how it
> can be misleading. Since we came up with skip_hw/sw I think it is
> expected that packets can be handled in both datapaths. The flag is
> just saying that hw has this flow. (btw, in_sw is simplified, as sw
> always accepts the flow if skip_sw is not used)
> 
>>
>> If that is not what is happening here, then please ignore my comments,
>> as I'm not sure what this code is about then. :)
>>
>>>
>>> But anyway, flows can have some packets matching in sw while also
>>> being in hw. That's expected. For example, in more complex flow sets,
>>> if a packet hit a flow with ct action and triggered a miss, all
>>> subsequent flows will handle this packet in sw. Or if we have queued
>>> packets in rx ring already and ovs just updated the datapath, these
>>> will match in tc sw instead of going to upcall. The latter will have
>>> only a few hits, yes, but the former will be increasing over time.
>>> I'm not sure how a new flag, which is probably more informative than
>>> an actual state indication, would help here.
>>
>> These cases are related to just one or a very few packets, so for them
>> it's generally fine to report 'in_hw', I think.  The vast majority of
>> traffic will be handled in HW.
>>
>> My thoughts were about a case where we have a lot of traffic handled
>> partially in HW and in SW.  Let's say we have N actions and HW doesn't
>> support action M.  In this case, driver may offload actions [0, M - 1]
>> inserting some kind of forced "HW miss" at the end, so actions [M, N]
>> can be executed in TC software.
> 
> Right. Please lets consider this other scenario then. Consider that we
> have these flows:
> chain 0,ip,match ip X actions=ct,goto chain 1
> chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
> chain 2 actions=output:3
> 
> The idea here is that on chain 1, the HW doesn't support that particular
> match on proto Y. That flow will never be in_hw, and that's okay. But
> the flow on chain 2, though, will be tagged as in_hw, and for packets
> following these specific sequence, they will get handled in sw on
> chain 2.
> 
> But if we have another flow there:
> chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
> which is supported by the hw, such packets would be handled by hw in
> chain 2.
> 
> The flow on chain 2 has no idea on what was done before it. It can't
> be tagged with _PARTIAL as the actions in there are not expected to
> trigger misses, yet, with this flow set, it is expected to handle
> packets in both datapaths, despite being 'in_hw'.
> 
> I guess what I'm trying so say is that it is not because a flow is
> tagged with in_hw that sw processing is unexpected straight away.
> 
> Hopefully this makes sense?

Yep.  I see your point.  In this case I agree that we can't really tell
if the traffic will be handled in HW or SW and the chain 2 will be
always handled in both.  So, the fact that it is 'in_hw' only means that
the chain is actually in HW as that HW actually has it.

Summarizing: having something doesn't mean using it. :)  So, thinking
that in_hw flows are actually fully processed in HW is a user's fault. :/

However, going back to my example where HW supports half of actions
in the chain ([0, M - 1]) and doesn't support the other half ([M, N])...
If the actions M to N are actually not installed into HW, marking the
chain as in_hw is a bit misleading, because unlike your example, not all
the actions are actually in HW and driver knows that.  For that case,
something like _PARTIAL suffix might still be useful.

> 
>>
>> But now I'm not sure if that is possible with the current implementation.
> 
> AFAICT you got all right. It is me that had misunderstood you. :)
> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>>>
>>>>> So I'm not sure either on what's the idea here.
>>>>>
>>>>> Thanks,
>>>>> Marcelo
>>>>>
>>>>>>
>>>>>> Sounds fine to me.  Thanks!
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>
>>>>
>>>
>>
>