netdev - Re: [net PATCH v2] net: sched, fix OOO packets with pfifo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <36a89ed1-d6ff-ddad-c736-4e68909d61c4@gmail.com>
Date:   Wed, 18 Apr 2018 09:44:28 -0700
From:   John Fastabend <john.fastabend@...il.com>
To:     Paolo Abeni <pabeni@...hat.com>,
        Cong Wang <xiyou.wangcong@...il.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        David Miller <davem@...emloft.net>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

On 04/18/2018 12:28 AM, Paolo Abeni wrote:
> Hi,
> 
> let me revive this old thread...
> 
> On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote:
>> On 03/26/2018 10:30 AM, Cong Wang wrote:
>>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
>>> <john.fastabend@...il.com> wrote:
>>>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>>>> enqueue threads and dequeue threads to run in parallel. On the
>>>> enqueue side the skb bit ooo_okay is used to ensure all related
>>>> skbs are enqueued in-order. On the dequeue side though there is
>>>> no similar logic. What we observe is with fewer queues than CPUs
>>>> it is possible to re-order packets when two instances of
>>>> __qdisc_run() are running in parallel. Each thread will dequeue
>>>> a skb and then whichever thread calls the ndo op first will
>>>> be sent on the wire. This doesn't typically happen because
>>>> qdisc_run() is usually triggered by the same core that did the
>>>> enqueue. However, drivers will trigger __netif_schedule()
>>>> when queues are transitioning from stopped to awake using the
>>>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>>>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>>>> is usually done in the interrupt completion context. This CPU
>>>> is selected with the irq affinity which is unrelated to the
>>>> enqueue operations.
>>>
>>> Interesting. Why this is unique to pfifo_fast? For me it could
>>> happen to other qdisc's too, when we release the qdisc root
>>> lock in sch_direct_xmit(), another CPU could dequeue from
>>> the same qdisc and transmit the skb in parallel too?
>>>
>>
>> Agreed, my guess is it never happens because the timing is
>> tighter in the lock case. Or if it is happening its infrequent
>> enough that no one noticed the OOO packets.
> 
> I think the above could not happend due to the qdisc seqlock - which is
> not acquired by NOLOCK qdiscs.
> 

Yep, seems to be the case.

>> For net-next we probably could clean this up. I was just
>> going for something simple in net that didn't penalize all
>> qdiscs as Eric noted. This patch doesn't make it any worse
>> at least. And we have been living with the above race for
>> years.
> 
> I've benchmarked this patch is some different scenario, and in my
> testing it introduces a measurable regression in uncontended/lightly
> contended scenarios. The measured peak negative delta is with a pktgen
> thread using "xmit_mode queue_xmit":
> 
> before: 27674032 pps
> after: 23809052 pps

Yeah more atomic ops :/

> 
> I spend some time searching a way to improve this, without success.
> 
> John, did you had any chance to look at this again?
> 

If we have a multiple cores pulling from the same skb list and
feeding the same txq this happens. One problem is even if the
normal dev_queue_xmit path is aligned drivers call netif_schedule
from interrupt context and that happens on an arbitrary a cpu. When
the arbitrary cpu runs the netif_schedule logic it will dequeue
from the skb list using the cpu it was scheduled on.

The lockless case is not _really_ lockless after this patch we
have managed to pull apart the enqueue and dequeue serialization
though.

Thanks for bringing this up. I'll think about it for a bit maybe
there is something we can do here. There is a set of conditions
that if met we can run without the lock. Possibly ONETXQUEUE and
aligned cpu_map is sufficient. We could detect this case and drop
the locking. For existing systems and high Gbps NICs I think (feel
free to correct me) assuming a core per cpu is OK. At some point
though we probably need to revisit this assumption.

.John

> Thanks,
> 
> Paolo
>