linux-kernel - Re: [PATCH 04/21] Generic percpu refcounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130528234728.GB2291@google.com>
Date:	Tue, 28 May 2013 16:47:28 -0700
From:	Kent Overstreet <koverstreet@...gle.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-aio@...ck.org, akpm@...ux-foundation.org,
	Zach Brown <zab@...hat.com>, Felipe Balbi <balbi@...com>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Mark Fasheh <mfasheh@...e.com>,
	Joel Becker <jlbec@...lplan.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Jens Axboe <axboe@...nel.dk>,
	Asai Thambi S P <asamymuthupa@...ron.com>,
	Selvan Mani <smani@...ron.com>,
	Sam Bradshaw <sbradshaw@...ron.com>,
	Jeff Moyer <jmoyer@...hat.com>,
	Al Viro <viro@...iv.linux.org.uk>,
	Benjamin LaHaise <bcrl@...ck.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Wed, May 15, 2013 at 10:37:20AM -0700, Tejun Heo wrote:
> Ooh, I was referring to percpu_ref_dead() not percpu_ref_kill().
> percpu_ref_dead() reminds me of some of the work state query functions
> in workqueue which ended up being misused in ways that were subtly
> racy, so I'm curious why it's necessary and how it's supposed to be
> used.

With the other changes we talked about I ended up killing
percpu_ref_dead()

> >  * Initially, a percpu refcount is just a set of percpu counters. Initially, we
> >  * don't try to detect the ref hitting 0 - which means that get/put can just
> >  * increment or decrement the local counter. Note that the counter on a
> >  * particular cpu can (and will) wrap - this is fine, when we go to shutdown the
> >  * percpu counters will all sum to the correct value (because moduler arithmatic
> >  * is commutative).
> 
> Can you please expand it on a bit and, more importantly, describe in
> what limits, it's safe?  This should be safe as long as the actual sum
> of refcnts given out doesn't overflow the original type, right? 

Precisely.

> It'd be great if that is explained clearly in more intuitive way.  The
> only actual explanation above is "modular arithmatic is commutative"
> which is a very compact way to put it and I really think it deserves
> an easier explanation.

I'm not sure I know of any good way of explaining it intuitively, but
here's this at least...

 * (More precisely: because moduler arithmatic is commutative the sum of all the
 * pcpu_count vars will be equal to what it would have been if all the gets and
 * puts were done to a single integer, even if some of the percpu integers
 * overflow or underflow).

> > > Are we sure this is enough?  1<<31 is a fairly large number but it's
> > > just easy enough to breach from time to time and it's gonna be hellish
> > > to reproduce / debug when it actually overflows.  Maybe we want
> > > atomic64_t w/ 1LLU << 63 bias?  Or is there something else which
> > > guarantees that the bias can't over/underflow?
> > 
> > Well, it has the effect of halving the usable range of the refcount,
> > which I think is probably ok - the thing is, the range of an atomic_t
> > doesn't really correspond to anything useful on 64 bit machines so if
> > you're concerned about overflow you probably need to be using an
> > atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
> > too.
> 
> I'm not worrying about the total refcnt overflowing 31 bits, that's
> fine.  What I'm worried about is the percpu refs having systmetic
> drift (got on certain cpus and put on others), and the total counter
> being overflowed while percpu draining is in progress.  To me, the
> problem is that the bias which tags that draining in progress can be
> overflown by percpu refs.  The summing can be the same but the tagging
> should be put where summing can't overflow it.  It'd be great if you
> can explain in the comment in what range it's safe and why, because
> that'd make the limits clear to both you and other people reading the
> code and would help a lot in deciding whether it's safe enough.

(This is why I initially didn't (don't) like the bias method, it makes
things harder to reason about).

The fact that the counter is percpu is irrelevant w.r.t. the bias; we
sum all the percpu counters up before adding them to the atomic counter
and subtracting the bias, so when we go to add the percpu counters it's
no different from if the percpu counter was a single integer all along.

So there's only two counters we're adding together; there's the percpu
counter (just think of it as a single integer) that we started out
using, but then at some point in time we start applying the gets and
puts to the atomic counter.

Note that there's no systemic drift here; at time t all the gets and
puts were happening to one counter, and then at time t+1 they switch to
a different counter.

We know the sum of the counters will be positive (again, because modular
arithmatic is still commutative; when we sum the counters it's as if
there was a single counter all along) but that doesn't mean either of
the individual counters can't be negative.

(Actually, unless I'm mistaken in this version the percpu counter can
never go negative - it definitely could with dynamic percpu allocation,
as you need a atomic_t -> percpu transition when the atomic_t was > 0
for the percpu counter to go negative; but in this version we start out
using the percpu counters and the atomic_t 0 (ignoring for the moment
the bias and the initial ref).

So, the sum must be positive but the atomic_t could be negative. How
negative?

We can't do a get() to the percpu counters after we've seen that the ref
is no longer in percpu mode - so after we've done one put to the
atomic_t we can do more puts to atomic_t (or gets to the atomic_t) but
we can't do a get to the percpu counter.

And we can't do more puts than there have been gets - because the sum
can't be negative. So the most puts() we can do at any given time is the
real count, or sum of the percpu ref and atomic_t.

Therefore, the amount the atomic_t can go negative is bounded by the
maximum value of the refcount.

So if we say arbitrarily that the maximum legal value of the refcount is
- say - 1U << 31, then the atomic_t will always be greater than
-((int) (1U << 31)).

So as long as the total number of outstanding refs never exceeds the
bias we're fine.

QED.

> I probably should have made it clearer.  Sorry about that.  tryget()
> is fine.  I was curious about count() as it's always a bit dangerous a
> query interface which is racy and can return something unexpected like
> false zero or underflowed refcnt.

Yeah, it is, it was intended just for the module code where it's only
used for the value lsmod shows.

> Let's just have percpu_ref_kill(ref, release) which puts the base ref
> and invokes release whenever it's done.

Release has to be stored in struct percpu_ref() so it can be invoked
after a call_rcu() (percpu_ref_kill -> call_rcu() ->
percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
percpu_ref_init(), but yeah.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/