netdev - Re: [Patch net-next] net_sched: remove the unsafe __skb_array

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 23 Dec 2017 22:57:49 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>
Subject: Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()

On 12/22/2017 12:31 PM, Cong Wang wrote:
> On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend
> <john.fastabend@...il.com> wrote:
>> On 12/21/2017 04:03 PM, Cong Wang wrote:
>>> __skb_array_empty() is only safe if array is never resized.
>>> pfifo_fast_dequeue() is called in TX BH context and without
>>> qdisc lock, so even after we disable BH on ->reset() path
>>> we can still race with other CPU's.
>>>
>>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>>> Reported-by: Jakub Kicinski <jakub.kicinski@...ronome.com>
>>> Cc: John Fastabend <john.fastabend@...il.com>
>>> Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
>>> ---
>>>  net/sched/sch_generic.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>> index 00ddb5f8f430..9279258ce060 100644
>>> --- a/net/sched/sch_generic.c
>>> +++ b/net/sched/sch_generic.c
>>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>       for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>               struct skb_array *q = band2list(priv, band);
>>>
>>> -             if (__skb_array_empty(q))
>>> -                     continue;
>>> -
>>>               skb = skb_array_consume_bh(q);
>>>       }
>>>       if (likely(skb)) {
>>>
>>
>>
>> So this is a performance thing we don't want to grab the consumer lock on
>> empty bands. Which can be fairly common depending on traffic patterns.
> 
> 
> I understand why you had it, but it is just not safe. You don't want
> to achieve performance gain by crashing system, right?

huh? So my point is the patch you submit here is not a
real fix but a work around. To peek the head of a consumer/producer ring
without a lock, _should_ be fine. This _should_ work as well with
consumer or producer operations happening at the same time. After some
digging the issue is in the ptr_ring code.

The peek code (what empty check calls) is the following,

static inline void *__ptr_ring_peek(struct ptr_ring *r)
{
        if (likely(r->size))
                return r->queue[r->consumer_head];
        return NULL;
}

So what the splat is detecting is consumer head being 'out of bounds'.
This happens because ptr_ring_discard_one increments the consumer_head
and then checks to see if it overran the array size. If above peek
happens after the increment, but before the size check we get the
splat. There are two ways, as far as I can see, to fix this. First
do the check before incrementing the consumer head. Or the easier
fix,

--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
ptr_ring *r,

 static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
gfp_t gfp)
 {
-       return kcalloc(size, sizeof(void *), gfp);
+       return kcalloc(size + 1, sizeof(void *), gfp);
 }

With Jakub's help (Thanks!) I was able to reproduce the original splat
and also verify the above removes it.

To be clear "resizing" a skb_array only refers to changing the actual
array size not adding/removing elements.

> 
>>
>> Although its not logical IMO to have both reset and dequeue running at
>> the same time. Some skbs would get through others would get sent, sort
>> of a mess. I don't see how it can be an issue. The never resized bit
>> in the documentation is referring to resizing the ring size _not_ popping
>> off elements of the ring. array_empty just reads the consumer head.
>> The only ring resizing in pfifo fast should be at init and destroy where
>> enqueue/dequeue should be disconnected by then. Although based on the
>> trace I missed a case.
> 
> 
> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
> 

Sorry not following.

> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
> concurrently.

Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
though if this API can be cleaned up. What are the paths that do a reset
without a destroy.. Do we really need to have this pattern where reset
is called then later destroy. Seems destroy could do the entire cleanup
and this would simplify things. None of this has to do with the splat
though.

> 
> 
>>
>> I think the right fix is to only call reset/destroy patterns after
>> waiting a grace period and for all tx_action calls in-flight to
>> complete. This is also better going forward for more complex qdiscs.
> 
> But we don't even have rcu read lock in TX BH, do we?
> 
> Also, people certainly don't like yet another synchronize_net()...
> 

This needs a fix and is a _real_ bug, but removing __skb_array_empty()
doesn't help solve this at all. Will work on a fix after the holiday
break. The fix here is to ensure the destroy is not going to happen
while tx_action is in-flight. Can be done with qdisc_run and checking
correct bits in lockless case.

.John