linux-kernel - Re: [PATCH] rps: process the skb directly if rps cpu not changed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <af8e5c83-cab9-1feb-121a-fb77a0b31215@huawei.com>
Date:   Thu, 23 Mar 2023 18:47:18 +0800
From:   Yunsheng Lin <linyunsheng@...wei.com>
To:     xu xin <xu.xin.sc@...il.com>, <kuba@...nel.org>
CC:     <davem@...emloft.net>, <edumazet@...gle.com>,
        <jiang.xuexin@....com.cn>, <linux-kernel@...r.kernel.org>,
        <netdev@...r.kernel.org>, <xu.xin16@....com.cn>,
        <yang.yang29@....com.cn>, <zhang.yunkai@....com.cn>
Subject: Re: [PATCH] rps: process the skb directly if rps cpu not changed

On 2023/3/23 18:04, xu xin wrote:
>> On 2023/3/22 15:24, xu xin wrote:
>>> [So sorry, I made a mistake in the reply title]
>>>
>>> On 2023/3/21 20:12, yang.yang29@....com.cn wrote:
>>>>> From: xu xin <xu.xin16@....com.cn>
>>>>>
>>>>> In the RPS procedure of NAPI receiving, regardless of whether the
>>>>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>>>>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>>>>> which will trigger a new NET_RX softirq.
>>>>
>>>> Does bypassing the backlog cause out of order problem for packet handling?
>>>> It seems currently the RPS/RFS will ensure order delivery,such as:
>>>> https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>>>>
>>>> Also, this is an optimization, it should target the net-next branch:
>>>> [PATCH net-next] rps: process the skb directly if rps cpu not changed
>>>>
>>>
>>> Well, I thought the patch would't break the effort RFS tried to avoid "Out of
>>> Order" packets. But thanks for your reminder, I rethink it again, bypassing the
>>> backlog from "netif_receive_skb_list" will mislead RFS's judging if all
>>> previous packets for the flow have been dequeued, where RFS thought all packets
>>> have been dealed with, but actually they are still in skb lists. Fortunately,
>>> bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
>>> cause OOO packets because every skb is processed serially by RPS and sent to the
>>> protocol stack as soon as possible.
>>
>> Suppose a lot of skbs have been queued to the backlog waiting to
>> processed and passed to the stack when current_cpu is not the same
>> as the target cpu,
> 
> Well.  I'm afraid that what we mean by current_cpu may be different. The
> "current_cpu" in my patch refers to the cpu NAPI poll is running on (Or
> the cpu that the skb origins from).

That's what I meant too. current_cpu refers to the cpu on which the driver's
NAPI poll is running.

> 
>> then current_cpu is changed to be the same as the
>> target cpu, with your patch, new skb will be processed and passed to
>> the stack immediately, which may bypass the old skb in the backlog.
>>
> I think Nop, RFS procedure won't let target cpu switch into a new cpu
> if there are still old skbs in the backlog of last recorded cpu. So the
> target cpu of the new skb will never equal to current_cpu if old skb in the
> backlog.
> ==========================================================================
> Let me draw the situation you described: At the time of T1, the app runs
> on cpu-0, so there are many packets queueing into the rxqueue-0 by RFS from
> CPU-1(suppose NAPI poll processing on cpu-1). Then, suddenly at the time of
> T2, the app tranfers to cpu-1, RFS know there are still old skb in rxqueue-0,
> so get_rps_cpu will not return a value of cpu-1, but cpu-0 instead.
> 
> ========================================================
> When T1, app runs on cpu-0:
>   APP
> -----------------------------
> |      |        |      |
> |cpu-0 |        |cpu-1 |
> |stack |        |stack |
> |      |        |      |
>    ^
>   |=|
>   |=|             | |
>   |=|             | |
> (rxqueue-0)      (rxqueue-1,empty)
>    ^<--
>        <--
>          <--
> 	     <-- packet(poll on cpu1)
> 			
> ===========================================================
> When T2, app tranfer to cpu-1, target cpu is still on cpu-0:
>                   APP
> ----------------------------
> |      |        |      |
> |cpu-0 |        |cpu-1 |
> |stack |        |stack |
> |      |        |      |
>    ^
>    |
>   |=|             | |
>   |=|             | |
> (rxqueue-0)      (rxqueue-2,empty)
>    ^<--
>        <--
>          <--
>             <-- packet(poll on cpu1)

what about the NAPI poll changing between cpu0 and cpu1 while
APP stays on the same cpu?

Note that backlog queue is per cpu.

> 
> ===================================
> 
> Thanks for your reply.
> 
>>>
>>> If I'm correct, the code as follws can fix this.
>>>
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>>>         if (static_branch_unlikely(&rps_needed)) {
>>>                 struct rps_dev_flow voidflow, *rflow = &voidflow;
>>>                 int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>>> +               int current_cpu = smp_processor_id();
>>>  
>>> -               if (cpu >= 0) {
>>> +               if (cpu >= 0 && cpu != current_cpu) {
>>>                         ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>>                         rcu_read_unlock();
>>>                         return ret;
>>> @@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
>>>                 list_for_each_entry_safe(skb, next, head, list) {
>>>                         struct rps_dev_flow voidflow, *rflow = &voidflow;
>>>                         int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>>> +                       int current_cpu = smp_processor_id();
>>>  
>>>                         if (cpu >= 0) {
>>>                                 /* Will be handled, remove from list */
>>>                                 skb_list_del_init(skb);
>>> -                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> +                               if (cpu != current_cpu)
>>> +                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> +                               else
>>> +                                       __netif_receive_skb(skb);
>>>                         }
>>>                 }
>>>
>>>
>>> Thanks.
>>>
>>>>>
>>>>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>>>>> CPU id equals to the current processing CPU, and we can call
>>>>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>>>>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>>>>> the processing delay of skb.
>>>>>
>>>>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>>>>> The test was done on the QEMU environment with two-core CPU by iperf3.
>>>>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>>
>>>>> Previous RPS:
>>>>> 		    	CPU0       CPU1
>>>>> NET_RX:         45          0    (before iperf3 testing)
>>>>> NET_RX:        1095         241   (after iperf3 testing)
>>>>>
>>>>> Patched RPS:
>>>>>                 CPU0       CPU1
>>>>> NET_RX:         28          4    (before iperf3 testing)
>>>>> NET_RX:         573         32   (after iperf3 testing)
>>>>
>>>> Sincerely.
>>>> Xu Xin
>>> .
>>>
> .
>