netdev - Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 9 Feb 2023 14:40:32 +0200
From:   Paul Blakey <paulb@...dia.com>
To:     Ilya Maximets <i.maximets@....org>,
        Marcelo Leitner <mleitner@...hat.com>
Cc:     netdev@...r.kernel.org, Saeed Mahameed <saeedm@...dia.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Oz Shlomo <ozsh@...dia.com>, Jiri Pirko <jiri@...dia.com>,
        Roi Dayan <roid@...dia.com>, Vlad Buslov <vladbu@...dia.com>
Subject: Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss
 to tc action



On 09/02/2023 14:07, Ilya Maximets wrote:
> On 2/9/23 02:09, Marcelo Leitner wrote:
>> On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
>>> On 2/8/23 19:01, Marcelo Leitner wrote:
>>>> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>>>>
>>>>>
>>>>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>>>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>>>>
>>>>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>>>>
>>>>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>>>>> be handled in hardware.
>>>>>>>>>>
>>>>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>>>>> packet modifications before sending it back to software so it can
>>>>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>>>>
>>>>>>>>>> The following is an example configuration of stateless nat
>>>>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>>>>
>>>>>>>>>>     #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>>>>     $ ip netns add ns0
>>>>>>>>>>     $ ip netns add ns1
>>>>>>>>>>     $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>>>>     $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>>>>     $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>>>>     $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>>>>
>>>>>>>>>>     #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>>>>     $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>>>>     $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>>>>     $ ifconfig enp8s0f0_0 up
>>>>>>>>>>     $ ifconfig enp8s0f0_1 up
>>>>>>>>>>
>>>>>>>>>>     #Original side
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>>>>        ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>>>>          action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>>>>          action csum ip tcp pipe \
>>>>>>>>>>          action ct pipe \
>>>>>>>>>>          action goto chain 1
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>>        ct_state +trk+est \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>>        ct_state +trk+new \
>>>>>>>>>>          action ct commit pipe \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>
>>>>>>>>>>     #Reply side
>>>>>>>>>>     $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>>     $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>>>>        ct_state -trk ip_proto tcp \
>>>>>>>>>>          action ct pipe \
>>>>>>>>>>          action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>>>>          action csum ip tcp pipe \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>>
>>>>>>>>>>     #Run traffic
>>>>>>>>>>     $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>>>>     $ sleep 2 #wait for iperf to fully open
>>>>>>>>>>     $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>>>>
>>>>>>>>>>     #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>>>>     $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>>>>            Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>            Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>            Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>
>>>>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>>>>> reject offloading the above action ct rule.
>>>>>>>>>>
>>>>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>>>>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>>>>
>>>>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>>>>> flags are not per-action.  This may cause confusion among users, if flows
>>>>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>>>>> processed in SW.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>>
>>>>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>>>>> different series, as this isn't a new problem from this series and existed before
>>>>>>>> it, at least with CT rules.
>>>>>>>
>>>>>>> Hmm, I didn't realize the issue already exists.
>>>>>>
>>>>>> Maintainers: please give me up to Friday to review this patchset.
>>>>>>
>>>>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>>>>
>>>>>> I don't follow. Can someone please rephase the issue please?
>>>>>> AFAICT, it is not that the NIC is offloading half of the action list
>>>>>> and never executing a part of it. Instead, for established connections
>>>>>> the rule will work fully offloaded. While for misses in the CT action,
>>>>>> it will simply trigger a miss, like it already does today.
>>>>>
>>>>> You got it right, and like you said it was like this before so its not
>>>>> strictly related by this series and could be in a different patchset. And I
>>>>> thought that (extra) flag would mean that it can miss, compared to other
>>>>> rules/actions combination that will never miss because they
>>>>> don't need sw support.
>>>>
>>>> This is different from what I understood from Ilya's comment. Maybe I
>>>> got his comment wrong, but I have the impression that he meant it in
>>>> the sense of having some actions offloaded and some not.
>>>> Which I thinkit is not the goal here.
>>>
>>> I don't really know the code around this patch set well enough, so my
>>> thoughts might be a bit irrelevant.  But after reading the cover letter
>>> and commit messages in this patch set I imagined that if we have some
>>> kind of miss on the N-th action in a list in HW, we could go to software
>>> tc, find that action and continue execution from it.  In this case some
>>> actions are executed in HW and some are in SW.
>>
>> Precisely. :)
>>
>>>
>>>  From the user's perspective, if such tc filter reports an 'in_hw' flag,
>>> that would be a bit misleading, IMO.
>>
>> I may be tainted or perhaps even biased here, but I don't see how it
>> can be misleading. Since we came up with skip_hw/sw I think it is
>> expected that packets can be handled in both datapaths. The flag is
>> just saying that hw has this flow. (btw, in_sw is simplified, as sw
>> always accepts the flow if skip_sw is not used)
>>
>>>
>>> If that is not what is happening here, then please ignore my comments,
>>> as I'm not sure what this code is about then. :)
>>>
>>>>
>>>> But anyway, flows can have some packets matching in sw while also
>>>> being in hw. That's expected. For example, in more complex flow sets,
>>>> if a packet hit a flow with ct action and triggered a miss, all
>>>> subsequent flows will handle this packet in sw. Or if we have queued
>>>> packets in rx ring already and ovs just updated the datapath, these
>>>> will match in tc sw instead of going to upcall. The latter will have
>>>> only a few hits, yes, but the former will be increasing over time.
>>>> I'm not sure how a new flag, which is probably more informative than
>>>> an actual state indication, would help here.
>>>
>>> These cases are related to just one or a very few packets, so for them
>>> it's generally fine to report 'in_hw', I think.  The vast majority of
>>> traffic will be handled in HW.
>>>
>>> My thoughts were about a case where we have a lot of traffic handled
>>> partially in HW and in SW.  Let's say we have N actions and HW doesn't
>>> support action M.  In this case, driver may offload actions [0, M - 1]
>>> inserting some kind of forced "HW miss" at the end, so actions [M, N]
>>> can be executed in TC software.
>>
>> Right. Please lets consider this other scenario then. Consider that we
>> have these flows:
>> chain 0,ip,match ip X actions=ct,goto chain 1
>> chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
>> chain 2 actions=output:3
>>
>> The idea here is that on chain 1, the HW doesn't support that particular
>> match on proto Y. That flow will never be in_hw, and that's okay. But
>> the flow on chain 2, though, will be tagged as in_hw, and for packets
>> following these specific sequence, they will get handled in sw on
>> chain 2.
>>
>> But if we have another flow there:
>> chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
>> which is supported by the hw, such packets would be handled by hw in
>> chain 2.
>>
>> The flow on chain 2 has no idea on what was done before it. It can't
>> be tagged with _PARTIAL as the actions in there are not expected to
>> trigger misses, yet, with this flow set, it is expected to handle
>> packets in both datapaths, despite being 'in_hw'.
>>
>> I guess what I'm trying so say is that it is not because a flow is
>> tagged with in_hw that sw processing is unexpected straight away.
>>
>> Hopefully this makes sense?
> 
> Yep.  I see your point.  In this case I agree that we can't really tell
> if the traffic will be handled in HW or SW and the chain 2 will be
> always handled in both.  So, the fact that it is 'in_hw' only means that
> the chain is actually in HW as that HW actually has it.
> 
> Summarizing: having something doesn't mean using it. :)  So, thinking
> that in_hw flows are actually fully processed in HW is a user's fault. :/
> 
> However, going back to my example where HW supports half of actions
> in the chain ([0, M - 1]) and doesn't support the other half ([M, N])...
> If the actions M to N are actually not installed into HW, marking the
> chain as in_hw is a bit misleading, because unlike your example, not all
> the actions are actually in HW and driver knows that.  For that case,
> something like _PARTIAL suffix might still be useful.


Right, and for now at least, no one does that :)

> 
>>
>>>
>>> But now I'm not sure if that is possible with the current implementation.
>>
>> AFAICT you got all right. It is me that had misunderstood you. :)
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>>>>
>>>>>> So I'm not sure either on what's the idea here.
>>>>>>
>>>>>> Thanks,
>>>>>> Marcelo
>>>>>>
>>>>>>>
>>>>>>> Sounds fine to me.  Thanks!
>>>>>>>
>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>