netdev - Re: [net PATCH] net: sched, fix OOO packets with pfifo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <32acd1d0-48b0-eaff-c971-9febdd33fcbd@gmail.com>
Date:   Sat, 24 Mar 2018 15:10:29 -0700
From:   John Fastabend <john.fastabend@...il.com>
To:     Eric Dumazet <eric.dumazet@...il.com>, xiyou.wangcong@...il.com,
        jiri@...nulli.us, davem@...emloft.net
Cc:     netdev@...r.kernel.org
Subject: Re: [net PATCH] net: sched, fix OOO packets with pfifo_fast

On 03/24/2018 02:15 PM, Eric Dumazet wrote:
> 
> 
> On 03/24/2018 01:13 PM, John Fastabend wrote:
>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>> enqueue threads and dequeue threads to run in parallel. On the
>> enqueue side the skb bit ooo_okay is used to ensure all related
>> skbs are enqueued in-order. On the dequeue side though there is
>> no similar logic. What we observe is with fewer queues than CPUs
>> it is possible to re-order packets when two instances of
>> __qdisc_run() are running in parallel. Each thread will dequeue
>> a skb and then whichever thread calls the ndo op first will
>> be sent on the wire. This doesn't typically happen because
>> qdisc_run() is usually triggered by the same core that did the
>> enqueue. However, drivers will trigger __netif_schedule()
>> when queues are transitioning from stopped to awake using the
>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>> is usually done in the interrupt completion context. This CPU
>> is selected with the irq affinity which is unrelated to the
>> enqueue operations.
>>
>> To resolve this we add a RUNNING bit to the qdisc to ensure
>> only a single dequeue per qdisc is running. Enqueue and dequeue
>> operations can still run in parallel and also on multi queue
>> NICs we can still have a dequeue in-flight per qdisc, which
>> is typically per CPU.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@...obroma-systems.com>
>> Signed-off-by: John Fastabend <john.fastabend@...il.com>
>> ---
>>  include/net/sch_generic.h |    1 +
>>  net/sched/sch_generic.c   |   13 ++++++++++---
>>  2 files changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 2092d33..8da3267 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -30,6 +30,7 @@ struct qdisc_rate_table {
>>  enum qdisc_state_t {
>>  	__QDISC_STATE_SCHED,
>>  	__QDISC_STATE_DEACTIVATED,
>> +	__QDISC_STATE_RUNNING,
>>  };
>>  
>>  struct qdisc_size_table {
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 7e3fbe9..29a1b47 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -377,12 +377,17 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>  	struct netdev_queue *txq;
>>  	struct net_device *dev;
>>  	struct sk_buff *skb;
>> -	bool validate;
>> +	bool more, validate;
>>  
>>  	/* Dequeue packet */
>> +	if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
>> +		return false;
>> +

[...]

> 
> This adds a pair of atomic operations in fast path, only for pfifo_fast sake.
> 

Yeah, we can wrap these in a `if (TCQ_F_NOLOCK)` to avoid it in cases
its not needed. Alternatively, for net we could turn off NOLOCK in
pfifo_fast and fix it net-next with something more complete.

> qdisc_restart() name is misleading, this is used from __qdisc_run()
> 
> 

I'll change it in net-next.

.John