[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180103004619-mutt-send-email-mst@kernel.org>
Date: Wed, 3 Jan 2018 01:12:55 +0200
From: "Michael S. Tsirkin" <mst@...hat.com>
To: John Fastabend <john.fastabend@...il.com>
Cc: David Miller <davem@...emloft.net>, jakub.kicinski@...ronome.com,
xiyou.wangcong@...il.com, jiri@...nulli.us, netdev@...r.kernel.org
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can
overrun array bounds
On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote:
> On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
> >>> From: John Fastabend <john.fastabend@...il.com>
> >>> Date: Wed, 27 Dec 2017 19:50:25 -0800
> >>>
> >>>> When running consumer and/or producer operations and empty checks in
> >>>> parallel its possible to have the empty check run past the end of the
> >>>> array. The scenario occurs when an empty check is run while
> >>>> __ptr_ring_discard_one() is in progress. Specifically after the
> >>>> consumer_head is incremented but before (consumer_head >= ring_size)
> >>>> check is made and the consumer head is zeroe'd.
> >>>>
> >>>> To resolve this, without having to rework how consumer/producer ops
> >>>> work on the array, simply add an extra dummy slot to the end of the
> >>>> array. Even if we did a rework to avoid the extra slot it looks
> >>>> like the normal case checks would suffer some so best to just
> >>>> allocate an extra pointer.
> >>>>
> >>>> Reported-by: Jakub Kicinski <jakub.kicinski@...ronome.com>
> >>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> >>>> Signed-off-by: John Fastabend <john.fastabend@...il.com>
> >>>
> >>> Applied, thanks John.
> >>
> >> I think that patch is wrong. I'd rather it got reverted.
> >
> > And replaced with something like the following - stil untested, but
> > apparently there's some urgency to fix it so posting for review ASAP.
> >
>
> So the above ptr_ring patch is meant for the dequeue() case not
> the peek case. Dequeue case shown here,
>
> static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> {
> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> struct sk_buff *skb = NULL;
> int band;
>
> for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> struct skb_array *q = band2list(priv, band);
>
> if (__skb_array_empty(q))
> continue;
>
> skb = skb_array_consume_bh(q);
> }
> if (likely(skb)) {
> qdisc_qstats_cpu_backlog_dec(qdisc, skb);
> qdisc_bstats_cpu_update(qdisc, skb);
> qdisc_qstats_cpu_qlen_dec(qdisc);
> }
>
> return skb;
> }
>
> In the dequeue case we use it purely as a hint and then do a proper
> consume (with locks) if needed. A false negative in this case means
> a consume is happening on another cpu due to a reset op or a
> dequeue op. (an aside but I'm evaluating if we should only allow a
> single dequeue'ing cpu at a time more below?) If its a reset op
> that caused the false negative it is OK because we are flushing the
> array anyways. If it is a dequeue op it is also OK because this core
> will abort but the running core will continue to dequeue avoiding a
> stall. So I believe false negatives are OK in the above function.
I'm not 100% sure I understand. We don't always dequeue until empty.
E.g. it seems that another CPU could dequeue an skb and then stop
because e.g. it reached a byte queue limit, then this CPU gets a false
negativesand stops because of that.
More generally, what makes this usage safe?
Is there a way to formalize it at the API level?
> > John, others, could you pls confirm it's not too bad performance-wise?
> > I'll then split it up properly and re-post.
> >
>
> I haven't benchmarked it but in the dequeue case taking a lock for
> every priority even when its empty seems unneeded.
Well it does seem to make the code more comprehensible and more robust.
But OK - I was looking at fixing the unlocked empty API to make sure it
actually does what it's supposed to. I posted a draft earlier in this
thread, it needs to be looked at in depth to figure out whether it can
ever give false negatives or positives, and document the results.
> > -->
> >
> > net: don't misuse ptr_ring_peek
> >
> > ptr_ring_peek only claims to be safe if the result is never
> > dereferenced, which isn't the case for its use in sch_generic.
> > Add locked API variants and use the bh one here.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
> >
> > ---
> >
>
> [...]
>
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
> > for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> > struct skb_array *q = band2list(priv, band);
> >
> > - skb = __skb_array_peek(q);
> > + skb = skb_array_peek_bh(q);
>
> Ah I should have added a comment here. For now peek() is only used from
> locking qdiscs. So peek and consume/produce operations will never happen
> in parallel. In this case we should never hit the false negative case with
> my patch or the out of bounds reference without my patch.
>
> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> current code another cpu could consume the skb and free it. Either we
> can ensure a single consumer runs at a time on an array (not the same as
> qdisc maybe) or just avoid peek operations in this case. My current plan
> was to just avoid peek() ops altogether, they seem unnecessary with the
> types of qdiscs I want to be build.
>
> Thanks,
> John
For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
> > }
> >
> > return skb;
> >
Powered by blists - more mailing lists