netdev - Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 2 Jan 2018 16:25:03 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     David Miller <davem@...emloft.net>, jakub.kicinski@...ronome.com,
        xiyou.wangcong@...il.com, jiri@...nulli.us, netdev@...r.kernel.org
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can
 overrun array bounds

On 01/02/2018 03:12 PM, Michael S. Tsirkin wrote:
> On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote:
>> On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
>>> On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
>>>> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
>>>>> From: John Fastabend <john.fastabend@...il.com>
>>>>> Date: Wed, 27 Dec 2017 19:50:25 -0800
>>>>>
>>>>>> When running consumer and/or producer operations and empty checks in
>>>>>> parallel its possible to have the empty check run past the end of the
>>>>>> array. The scenario occurs when an empty check is run while
>>>>>> __ptr_ring_discard_one() is in progress. Specifically after the
>>>>>> consumer_head is incremented but before (consumer_head >= ring_size)
>>>>>> check is made and the consumer head is zeroe'd.
>>>>>>
>>>>>> To resolve this, without having to rework how consumer/producer ops
>>>>>> work on the array, simply add an extra dummy slot to the end of the
>>>>>> array. Even if we did a rework to avoid the extra slot it looks
>>>>>> like the normal case checks would suffer some so best to just
>>>>>> allocate an extra pointer.
>>>>>>
>>>>>> Reported-by: Jakub Kicinski <jakub.kicinski@...ronome.com>
>>>>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
>>>>>> Signed-off-by: John Fastabend <john.fastabend@...il.com>
>>>>>
>>>>> Applied, thanks John.
>>>>
>>>> I think that patch is wrong. I'd rather it got reverted.
>>>
>>> And replaced with something like the following - stil untested, but
>>> apparently there's some urgency to fix it so posting for review ASAP.
>>>
>>
>> So the above ptr_ring patch is meant for the dequeue() case not
>> the peek case. Dequeue case shown here,
>>
>> static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>> {
>>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>         struct sk_buff *skb = NULL;
>>         int band;
>>
>>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>                 struct skb_array *q = band2list(priv, band);
>>
>>                 if (__skb_array_empty(q))
>>                         continue;
>>
>>                 skb = skb_array_consume_bh(q);
>>         }
>>         if (likely(skb)) {
>>                 qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>>                 qdisc_bstats_cpu_update(qdisc, skb);
>>                 qdisc_qstats_cpu_qlen_dec(qdisc);
>>         }
>>
>>         return skb;
>> }
>>
>> In the dequeue case we use it purely as a hint and then do a proper
>> consume (with locks) if needed.  A false negative in this case means
>> a consume is happening on another cpu due to a reset op or a
>> dequeue op. (an aside but I'm evaluating if we should only allow a
>> single dequeue'ing cpu at a time more below?) If its a reset op
>> that caused the false negative it is OK because we are flushing the
>> array anyways. If it is a dequeue op it is also OK because this core
>> will abort but the running core will continue to dequeue avoiding a
>> stall. So I believe false negatives are OK in the above function.
> 
> I'm not 100% sure I understand. We don't always dequeue until empty.
> E.g. it seems that another CPU could dequeue an skb and then stop
> because e.g. it reached a byte queue limit, then this CPU gets a false
> negativesand stops because of that.
> 

If we don't dequeue until empty, byte limits or skb limits are reached,
then netif_reschedule is called and we reschedule another qdisc_run()
in tx_action.

> 
> More generally, what makes this usage safe?
> Is there a way to formalize it at the API level?
> 

Right I think these are good questions. I think the ptr_ring API should
allow a peek operation to be used without a lock. The user has to ensure
they only use it as a hint and if its dereferenced the user needs to
ensure the object is not free'd from some other codepath while it is
being dereferenced. The existing API seems to match this.

This is how I used it in pfifo_fast expecting the above to be true. The
API allows for false negatives which _should_ be OK if the user is expecting
this. Alternatively, we could make it false positives if you want and
that would also work for me considering this case is hit very rarely.

>>> John, others, could you pls confirm it's not too bad performance-wise?
>>> I'll then split it up properly and re-post.
>>>
>>
>> I haven't benchmarked it but in the dequeue case taking a lock for
>> every priority even when its empty seems unneeded.
> 
> Well it does seem to make the code more comprehensible and more robust.
> 

Its a trade-off between performance and robustness.

> But OK -  I was looking at fixing the unlocked empty API to make sure it
> actually does what it's supposed to. I posted a draft earlier in this
> thread, it needs to be looked at in depth to figure out whether it can
> ever give false negatives or positives, and document the results.
> 
> 

I'll look at it. But I still think keeping a lockless version makes sense
for many use cases.

> 
>>> -->
>>>
>>> net: don't misuse ptr_ring_peek
>>>
>>> ptr_ring_peek only claims to be safe if the result is never
>>> dereferenced, which isn't the case for its use in sch_generic.
>>> Add locked API variants and use the bh one here.
>>>
>>> Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
>>>
>>> ---
>>>
>>
>> [...]
>>
>>> --- a/net/sched/sch_generic.c
>>> +++ b/net/sched/sch_generic.c
>>> @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>  		struct skb_array *q = band2list(priv, band);
>>>  
>>> -		skb = __skb_array_peek(q);
>>> +		skb = skb_array_peek_bh(q);
>>
>> Ah I should have added a comment here. For now peek() is only used from
>> locking qdiscs. So peek and consume/produce operations will never happen
>> in parallel. In this case we should never hit the false negative case with
>> my patch or the out of bounds reference without my patch.
>>
>> Doing a peek() op without qdisc lock is a bit problematic anyways. With
>> current code another cpu could consume the skb and free it. Either we
>> can ensure a single consumer runs at a time on an array (not the same as
>> qdisc maybe) or just avoid peek operations in this case. My current plan
>> was to just avoid peek() ops altogether, they seem unnecessary with the
>> types of qdiscs I want to be build.
>>
>> Thanks,
>> John
> 
> For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
> 

No, I think after this patch (net: ptr_ring: otherwise safe empty checks...) is
applied we do not need any additional fixes in net-next. Future work will
require the above patch (the one you provided) though so its useful work.

I'll do another review of the false positive case though to be sure the
current code is OK wrt handling false positives and any potential stalls.

>>>  	}
>>>  
>>>  	return skb;
>>>