[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130927204116.GJ15690@laptop.programming.kicks-ass.net>
Date: Fri, 27 Sep 2013 22:41:16 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Oleg Nesterov <oleg@...hat.com>
Cc: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Ingo Molnar <mingo@...nel.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Linux-MM <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > But if the readers does see BLOCK it will not be an active reader no
> > more; and thus the writer doesn't need to observe and wait for it.
>
> I meant they both can block, but please ignore. Today I simply can't
> understand what I was thinking about yesterday.
I think we all know that state all too well ;-)
> I tried hard to find any hole in this version but failed, I believe it
> is correct.
Yay!
> But, could you help me to understand some details?
I'll try, but I'm not too bright atm myself :-)
> > +void __get_online_cpus(void)
> > +{
> > +again:
> > + /* See __srcu_read_lock() */
> > + __this_cpu_inc(__cpuhp_refcount);
> > + smp_mb(); /* A matches B, E */
> > + __this_cpu_inc(cpuhp_seq);
> > +
> > + if (unlikely(__cpuhp_state == readers_block)) {
>
> Note that there is no barrier() after inc(seq) and __cpuhp_state
> check, this inc() can be "postponed" till ...
>
> > +void __put_online_cpus(void)
> > {
> > + /* See __srcu_read_unlock() */
> > + smp_mb(); /* C matches D */
>
> ... this mb() in __put_online_cpus().
>
> And this is fine! The qustion is, perhaps it would be more "natural"
> and understandable to shift this_cpu_inc(cpuhp_seq) into
> __put_online_cpus().
Possibly; I never got further than that the required order is:
ref++
MB
seq++
MB
ref--
It doesn't matter if the seq++ is in the lock or unlock primitive. I
never considered one place more natural than the other.
> We need to ensure 2 things:
>
> 1. The reader should notic state = BLOCK or the writer should see
> inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> __get_online_cpus() and in cpu_hotplug_begin().
>
> We do not care if the writer misses some inc(__cpuhp_refcount)
> in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> state = readers_block (and inc(cpuhp_seq) can't help anyway).
Agreed.
> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> from __put_online_cpus() (note that the writer can miss the
> corresponding inc() if it was done on another CPU, so this dec()
> can lead to sum() == 0), it should also notice the change in cpuhp_seq.
>
> Fortunately, this can only happen if the reader migrates, in
> this case schedule() provides a barrier, the writer can't miss
> the change in cpuhp_seq.
Again, agreed; this is also the message of the second comment in
cpuhp_readers_active_check() by Paul.
> IOW. Unless I missed something, cpuhp_seq is actually needed to
> serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
> and /* D matches C */ in cpuhp_readers_active_check(), and this
> is not immediately clear if you look at __get_online_cpus().
>
> I do not suggest to change this code, but please tell me if my
> understanding is not correct.
I think you're entirely right.
> > +static bool cpuhp_readers_active_check(void)
> > {
> > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > +
> > + smp_mb(); /* B matches A */
> > +
> > + /*
> > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > + */
> >
> > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > + return false;
> >
> > + smp_mb(); /* D matches C */
>
> It seems that both barries could be smp_rmb() ? I am not sure the comments
> from srcu_readers_active_idx_check() can explain mb(), note that
> __srcu_read_lock() always succeeds unlike get_cpus_online().
I see what you mean; cpuhp_readers_active_check() is all purely reads;
there are no writes to order.
Paul; is there any argument for the MB here as opposed to RMB; and if
not should we change both these and SRCU?
> > void cpu_hotplug_done(void)
> > {
> > + /* Signal the writer is done, no fast path yet. */
> > + __cpuhp_state = readers_slow;
> > + wake_up_all(&cpuhp_readers);
> > +
> > + /*
> > + * The wait_event()/wake_up_all() prevents the race where the readers
> > + * are delayed between fetching __cpuhp_state and blocking.
> > + */
> > +
> > + /* See percpu_up_write(); readers will no longer attempt to block. */
> > + synchronize_sched();
> > +
> > + /* Let 'em rip */
> > + __cpuhp_state = readers_fast;
> > + current->cpuhp_ref--;
> > +
> > + /*
> > + * Wait for any pending readers to be running. This ensures readers
> > + * after writer and avoids writers starving readers.
> > + */
> > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > }
>
> OK, to some degree I can understand "avoids writers starving readers"
> part (although the next writer should do synchronize_sched() first),
> but could you explain "ensures readers after writer" ?
Suppose reader A sees state == BLOCK and goes to sleep; our writer B
does cpu_hotplug_done() and wakes all pending readers. If for some
reason A doesn't schedule to inc ref until B again executes
cpu_hotplug_begin() and state is once again BLOCK, A will not have made
any progress.
The waitcount increment before __put_online_cpus() ensures
cpu_hotplug_done() sees the !0 waitcount and waits until out reader runs
far enough to at least pass the dec_and_test().
And once past the dec_and_test() preemption is disabled and the
sched_sync() in a new cpu_hotplug_begin() will suffice to guarantee
we'll have acquired a reference and are an active reader.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists