netdev - Re:Re: Re: Re: [PATCH net] net: Correct wrong skb_flow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 14 May 2018 09:26:21 +0800 (CST)
From:   "Gao Feng" <gfree.wind@....163.com>
To:     "Willem de Bruijn" <willemdebruijn.kernel@...il.com>
Cc:     "davem@...emloft.net" <davem@...emloft.net>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "jakub.kicinski@...ronome.com" <jakub.kicinski@...ronome.com>,
        "David Ahern" <dsahern@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re:Re: Re: Re: [PATCH net] net: Correct wrong skb_flow_limit check
 when enable RPS

At 2018-05-11 22:56:04, "Willem de Bruijn" <willemdebruijn.kernel@...il.com> wrote:
>On Fri, May 11, 2018 at 10:44 AM, Gao Feng <gfree.wind@....163.com> wrote:
>> At 2018-05-11 21:23:55, "Willem de Bruijn" <willemdebruijn.kernel@...il.com> wrote:
>>>On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.wind@....163.com> wrote:
>>>> At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@...il.com> wrote:
>>>>>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@....163.com> wrote:
>>>>>> From: Gao Feng <gfree.wind@....163.com>
>>>>>>
>>>>>> The skb flow limit is implemented for each CPU independently. In the
>>>>>> current codes, the function skb_flow_limit gets the softnet_data by
>>>>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>>>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>>>>> the stats of current CPU, while the skb is going to append the queue of
>>>>>> another CPU. It isn't the expected behavior.
>>>>>>
>>>>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>>>>
>>>>>The local cpu softnet_data is used on purpose. The operations in
>>>>>skb_flow_limit() on sd fields could race if not executed on the local cpu.
>>>>
>>>> I think the race doesn't exist because of the rps_lock.
>>>> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.
>>>
>>>Indeed, I overlooked that. There still is the matter of cache contention.
>>
>> The cache contention is really important in this case?
>> I don't think so, because the enqueue_to_backlog have touched and modified the softnet_stat
>> of target cpu.
>>
>>>
>>>>>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>>>>>These would always hit the same RPS cpu, so that cpu being backlogged
>>>>
>>>> They may hit the different target CPU when enable RFS. Because the app could be scheduled
>>>> to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.
>>>
>>>This even more suggest using the initial (or IRQ) cpu to track state, instead
>>>of the destination (RPS/RFS) cpu.
>>
>> I couldn't understand why it is better to track state on initial cpu, not the target cpu.
>> The latter one could get more accurate result.
>
>For a single DoS flow with normal cpu pinned IRQs, the results will be equally
>good when tracked on the initial IRQ cpu..
>
>>
>>>
>>>>>may be an indication that such a flow is active. But the flow will also always
>>>>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table
>>>>
>>>> The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
>>>> the cpus.
>>>
>>>IRQs are usually pinned to cores. Unless using something like irqbalance,
>>>but that operates at too coarse a timescale to do anything useful at Mpps
>>>packet rates.
>>
>> There are some motherboard which couldn't make sure the irq is pinned.
>> The flow_limit wouldn't work as well as expected.
>
>.. this seems to be the crux of the argument. I am not aware of any network
>interrupts that do not adhere to the cpu pinning configuration in
>
>   /proc/irq/$IRQ/smp_affinity(_list)
>

When smp_affinity is configured 0xff, I met some hardwares they deliver most of
irqs to a specific cpu, and left irqs are spread to the other cpus. And it couldn't 
make sure the irq of one flow is delivered to a fixed cpu.

I don't know if it is caused by apic or motherboard.
And what information you need to confirm it.

>What kind of hardware ignores this setting and sprays interrupts? I agree
>that in that case flow_limit as is may be ineffective (if migration happens
>at rates comparable to packet rates). But this should not happen?
>
>>
>>>
>>>>>on the initial CPU is also fine. There may be false positives on other CPUs
>>>>>with the same RPS destination, but that is unlikely with a highly concurrent
>>>>>traffic server mix ("mice").
>>>>
>>>> If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
>>>> it may be sent to one same target cpu.
>>>>
>>>>>
>>>>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>>>>>for the cpus on which traffic initially lands, not the RPS destination cpus.
>>>>>See also Documentation/networking/scaling.txt
>>>>>
>>>>>That said, I had to reread the code, as it does seem sensible that the
>>>>>same softnet_data is intended to be used both when testing qlen and
>>>>>flow_limit.
>>>>
>>>> In most cases, user configures the same RPS map with flow_limit like 0xff.
>>>> Because user couldn't predict which core the evil flow would arrive on.
>>>>
>>>> Take an example, there are 2 cores, cpu0 and cpu1.
>>>> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
>>>> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
>>>> of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.
>>>
>>>No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does
>>>check the queue length of the RPS cpu.
>>
>> Sorry, I overlooked the qlen is the length of the rps cpu.
>> Then it's ok unless the stats may be not accurate when irq isn't pinned.
>>
>> But I still doubt that is it really important to track state on initial cpu, not target cpu?
>> Because the enqueue_to_backlog have touched the softnet_data of target cpu.
>
>I think the merit of both IRQ and RPS cpu can be argued for attaching the
>flow_limit state.
>
>Either way, the current behavior is not a bug, so I don't think that this is a
>candidate for net.
>
>The cost of moving from IRQ to RPS cpu will be the cacheline contention
>on a system with multiple IRQ cpus that all try to update the sd->flow_data
>of the same RPS cpus. Which is particularly likely with RFS. I suspect that
>this cost is non-trivial and not worth the benefit of handling hardware with
>unpinned IRQs.

Ok, I agree with you.
Thanks your detail discussion with me.

Best Regards
Feng