netdev - Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180103004619-mutt-send-email-mst@kernel.org>
Date:   Wed, 3 Jan 2018 01:12:55 +0200
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     John Fastabend <john.fastabend@...il.com>
Cc:     David Miller <davem@...emloft.net>, jakub.kicinski@...ronome.com,
        xiyou.wangcong@...il.com, jiri@...nulli.us, netdev@...r.kernel.org
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can
 overrun array bounds

On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote:
> On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
> >>> From: John Fastabend <john.fastabend@...il.com>
> >>> Date: Wed, 27 Dec 2017 19:50:25 -0800
> >>>
> >>>> When running consumer and/or producer operations and empty checks in
> >>>> parallel its possible to have the empty check run past the end of the
> >>>> array. The scenario occurs when an empty check is run while
> >>>> __ptr_ring_discard_one() is in progress. Specifically after the
> >>>> consumer_head is incremented but before (consumer_head >= ring_size)
> >>>> check is made and the consumer head is zeroe'd.
> >>>>
> >>>> To resolve this, without having to rework how consumer/producer ops
> >>>> work on the array, simply add an extra dummy slot to the end of the
> >>>> array. Even if we did a rework to avoid the extra slot it looks
> >>>> like the normal case checks would suffer some so best to just
> >>>> allocate an extra pointer.
> >>>>
> >>>> Reported-by: Jakub Kicinski <jakub.kicinski@...ronome.com>
> >>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> >>>> Signed-off-by: John Fastabend <john.fastabend@...il.com>
> >>>
> >>> Applied, thanks John.
> >>
> >> I think that patch is wrong. I'd rather it got reverted.
> > 
> > And replaced with something like the following - stil untested, but
> > apparently there's some urgency to fix it so posting for review ASAP.
> > 
> 
> So the above ptr_ring patch is meant for the dequeue() case not
> the peek case. Dequeue case shown here,
> 
> static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> {
>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>         struct sk_buff *skb = NULL;
>         int band;
> 
>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>                 struct skb_array *q = band2list(priv, band);
> 
>                 if (__skb_array_empty(q))
>                         continue;
> 
>                 skb = skb_array_consume_bh(q);
>         }
>         if (likely(skb)) {
>                 qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>                 qdisc_bstats_cpu_update(qdisc, skb);
>                 qdisc_qstats_cpu_qlen_dec(qdisc);
>         }
> 
>         return skb;
> }
> 
> In the dequeue case we use it purely as a hint and then do a proper
> consume (with locks) if needed.  A false negative in this case means
> a consume is happening on another cpu due to a reset op or a
> dequeue op. (an aside but I'm evaluating if we should only allow a
> single dequeue'ing cpu at a time more below?) If its a reset op
> that caused the false negative it is OK because we are flushing the
> array anyways. If it is a dequeue op it is also OK because this core
> will abort but the running core will continue to dequeue avoiding a
> stall. So I believe false negatives are OK in the above function.

I'm not 100% sure I understand. We don't always dequeue until empty.
E.g. it seems that another CPU could dequeue an skb and then stop
because e.g. it reached a byte queue limit, then this CPU gets a false
negativesand stops because of that.


More generally, what makes this usage safe?
Is there a way to formalize it at the API level?

> > John, others, could you pls confirm it's not too bad performance-wise?
> > I'll then split it up properly and re-post.
> > 
> 
> I haven't benchmarked it but in the dequeue case taking a lock for
> every priority even when its empty seems unneeded.

Well it does seem to make the code more comprehensible and more robust.

But OK -  I was looking at fixing the unlocked empty API to make sure it
actually does what it's supposed to. I posted a draft earlier in this
thread, it needs to be looked at in depth to figure out whether it can
ever give false negatives or positives, and document the results.



> > -->
> > 
> > net: don't misuse ptr_ring_peek
> > 
> > ptr_ring_peek only claims to be safe if the result is never
> > dereferenced, which isn't the case for its use in sch_generic.
> > Add locked API variants and use the bh one here.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
> > 
> > ---
> > 
> 
> [...]
> 
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
> >  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> >  		struct skb_array *q = band2list(priv, band);
> >  
> > -		skb = __skb_array_peek(q);
> > +		skb = skb_array_peek_bh(q);
> 
> Ah I should have added a comment here. For now peek() is only used from
> locking qdiscs. So peek and consume/produce operations will never happen
> in parallel. In this case we should never hit the false negative case with
> my patch or the out of bounds reference without my patch.
> 
> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> current code another cpu could consume the skb and free it. Either we
> can ensure a single consumer runs at a time on an array (not the same as
> qdisc maybe) or just avoid peek operations in this case. My current plan
> was to just avoid peek() ops altogether, they seem unnecessary with the
> types of qdiscs I want to be build.
> 
> Thanks,
> John

For the lockless qdisc, for net-next, do we need the patch above until you fix that though?

> >  	}
> >  
> >  	return skb;
> >