netdev - Re: Re: [PATCH net] net: Correct wrong skb_flow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAF=yD-JuAivOyrimVSxB_TtaiyiT498Z6fpe+XpF_oNqCMdUPA@mail.gmail.com>
Date:   Fri, 11 May 2018 09:23:55 -0400
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Gao Feng <gfree.wind@....163.com>
Cc:     "davem@...emloft.net" <davem@...emloft.net>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "jakub.kicinski@...ronome.com" <jakub.kicinski@...ronome.com>,
        David Ahern <dsahern@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Re: [PATCH net] net: Correct wrong skb_flow_limit check when
 enable RPS

On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.wind@....163.com> wrote:
> At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@...il.com> wrote:
>>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@....163.com> wrote:
>>> From: Gao Feng <gfree.wind@....163.com>
>>>
>>> The skb flow limit is implemented for each CPU independently. In the
>>> current codes, the function skb_flow_limit gets the softnet_data by
>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>> the stats of current CPU, while the skb is going to append the queue of
>>> another CPU. It isn't the expected behavior.
>>>
>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>
>>The local cpu softnet_data is used on purpose. The operations in
>>skb_flow_limit() on sd fields could race if not executed on the local cpu.
>
> I think the race doesn't exist because of the rps_lock.
> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.

Indeed, I overlooked that. There still is the matter of cache contention.

>>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>>These would always hit the same RPS cpu, so that cpu being backlogged
>
> They may hit the different target CPU when enable RFS. Because the app could be scheduled
> to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.

This even more suggest using the initial (or IRQ) cpu to track state, instead
of the destination (RPS/RFS) cpu.

>>may be an indication that such a flow is active. But the flow will also always
>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table
>
> The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
> the cpus.

IRQs are usually pinned to cores. Unless using something like irqbalance,
but that operates at too coarse a timescale to do anything useful at Mpps
packet rates.

>>on the initial CPU is also fine. There may be false positives on other CPUs
>>with the same RPS destination, but that is unlikely with a highly concurrent
>>traffic server mix ("mice").
>
> If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
> it may be sent to one same target cpu.
>
>>
>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>>for the cpus on which traffic initially lands, not the RPS destination cpus.
>>See also Documentation/networking/scaling.txt
>>
>>That said, I had to reread the code, as it does seem sensible that the
>>same softnet_data is intended to be used both when testing qlen and
>>flow_limit.
>
> In most cases, user configures the same RPS map with flow_limit like 0xff.
> Because user couldn't predict which core the evil flow would arrive on.
>
> Take an example, there are 2 cores, cpu0 and cpu1.
> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
> of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.

No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does
check the queue length of the RPS cpu.

> Then the cpu0 inserts the skb into the queue of cpu1.
> As a result, the skb_flow_limit doesn't work as expected.
>
> BTW, I have already sent the v2 patch which only adds the "Fixes: tag".

The change also makes the code inconsistent with
Documentation/networking/scaling.txt

"In such environments, enable the feature on all CPUs that handle
network rx interrupts (as set in /proc/irq/N/smp_affinity)."