[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72849568-7336-6be7-538c-534146e69def@gmail.com>
Date: Wed, 27 Dec 2017 15:53:36 -0800
From: John Fastabend <john.fastabend@...il.com>
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>,
Jakub Kicinski <jakub.kicinski@...ronome.com>
Subject: Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
On 12/27/2017 10:29 AM, Cong Wang wrote:
> On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
> <john.fastabend@...il.com> wrote:
>> On 12/22/2017 12:31 PM, Cong Wang wrote:
>>> I understand why you had it, but it is just not safe. You don't want
>>> to achieve performance gain by crashing system, right?
>>
>> huh? So my point is the patch you submit here is not a
>> real fix but a work around. To peek the head of a consumer/producer ring
>> without a lock, _should_ be fine. This _should_ work as well with
>> consumer or producer operations happening at the same time. After some
>> digging the issue is in the ptr_ring code.
>
>
> The comments disagree with you:
>
> /* Might be slightly faster than skb_array_empty below, but only safe if the
> * array is never resized. Also, callers invoking this in a loop must take care
> * to use a compiler barrier, for example cpu_relax().
> */
>
> If the comments are right, you miss a barrier here too.
>
The comment talks about resizing arrays not consumer/producer operations
so I think it is OK. We don't call skb_array_empty on the same skb_array
in a loop here, its different skb arrays. So probably don't need a barrier.
>
>>
>> The peek code (what empty check calls) is the following,
>>
>> static inline void *__ptr_ring_peek(struct ptr_ring *r)
>> {
>> if (likely(r->size))
>> return r->queue[r->consumer_head];
>> return NULL;
>> }
>>
>> So what the splat is detecting is consumer head being 'out of bounds'.
>> This happens because ptr_ring_discard_one increments the consumer_head
>> and then checks to see if it overran the array size. If above peek
>> happens after the increment, but before the size check we get the
>> splat. There are two ways, as far as I can see, to fix this. First
>> do the check before incrementing the consumer head. Or the easier
>> fix,
>>
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
>> ptr_ring *r,
>>
>> static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
>> gfp_t gfp)
>> {
>> - return kcalloc(size, sizeof(void *), gfp);
>> + return kcalloc(size + 1, sizeof(void *), gfp);
>> }
>>
>> With Jakub's help (Thanks!) I was able to reproduce the original splat
>> and also verify the above removes it.
>>
>> To be clear "resizing" a skb_array only refers to changing the actual
>> array size not adding/removing elements.
>
> I never look into the implementation, just simply trust the comments.
>
> At least the comments above __skb_array_empty() need to improve.
> >
>>
>>>
>>>>
>>>> Although its not logical IMO to have both reset and dequeue running at
>>>> the same time. Some skbs would get through others would get sent, sort
>>>> of a mess. I don't see how it can be an issue. The never resized bit
>>>> in the documentation is referring to resizing the ring size _not_ popping
>>>> off elements of the ring. array_empty just reads the consumer head.
>>>> The only ring resizing in pfifo fast should be at init and destroy where
>>>> enqueue/dequeue should be disconnected by then. Although based on the
>>>> trace I missed a case.
>>>
>>>
>>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>>
>>
>> Sorry not following.
>>
>>> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
>>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>>> concurrently.
>>
>> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
>> though if this API can be cleaned up. What are the paths that do a reset
>> without a destroy.. Do we really need to have this pattern where reset
>> is called then later destroy. Seems destroy could do the entire cleanup
>> and this would simplify things. None of this has to do with the splat
>> though.
>
> I don't follow your point any more.>
> We are talking about ->reset() race with ->dequeue() which is the
> cause of the bug, right?>
Yes. But it should be OK for qdiscs to have reset and dequeue run
in parallel. So the bug is in ptr_ring.
> If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
> why did you even mention synchronize_net() from the beginning???
A mistake probably I am tracking two separate issues. First this one
with skb_array out of bounds access and the tx_action being called
while destroy is in progress. When I was first looking at this I thought
the splat was from the tx_action issue so probably started talking
about how to sync destroy and tx_action paths. Either way it was
a mistake to mention in this thread.
> Also you changed the code too, to adjust rcu grace period.
>
>
>>
>>>
>>>
>>>>
>>>> I think the right fix is to only call reset/destroy patterns after
>>>> waiting a grace period and for all tx_action calls in-flight to
>>>> complete. This is also better going forward for more complex qdiscs.
>>>
>>> But we don't even have rcu read lock in TX BH, do we?
>>>
>>> Also, people certainly don't like yet another synchronize_net()...
>>>
>>
>> This needs a fix and is a _real_ bug, but removing __skb_array_empty()
>> doesn't help solve this at all. Will work on a fix after the holiday
>> break. The fix here is to ensure the destroy is not going to happen
>> while tx_action is in-flight. Can be done with qdisc_run and checking
>> correct bits in lockless case.
>
> Sounds like you missed a lot of things with your "lockless" patches....
> First qdisc rcu callback, second rcu read lock in TX BH...
A couple things yes. TX BH, ptr_ring bug, skb list free fixed in
another patch. I assume the qdisc rcu callbacks you talk about are
the TX BH or something else?
The other possible bug, still need to do analysis, is if we need to
add back the qdisc destroy rcu callback. Or if there is a grace period
in all cases. My guess is current code paths in pfifo fast don't have
an issue but future work will.
>
> My quick one-line fix is to amend this bug before you going deeper
> in this rabbit hole.
>
Nope its not a good fix and we are in net-next timeframe here so lets
fix the root cause in ptr_ring. Will post a patch in the next couple
days after PTO. Its minor issue and can wait.
.John
Powered by blists - more mailing lists