[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070928144714.GA397@tv-sign.ru>
Date: Fri, 28 Sep 2007 18:47:14 +0400
From: Oleg Nesterov <oleg@...sign.ru>
To: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc: linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
mingo@...e.hu, akpm@...ux-foundation.org, dipankar@...ibm.com,
josht@...ux.vnet.ibm.com, tytso@...ibm.com, dvhltc@...ibm.com,
tglx@...utronix.de, a.p.zijlstra@...llo.nl, bunk@...nel.org,
ego@...ibm.com, srostedt@...hat.com
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU
On 09/27, Paul E. McKenney wrote:
>
> On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> >
> > Yes, yes, I see now. We really need this barriers, except I think
> > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> >
> > // rcu_try_flip_waitzero()
> > if (A == 0) {
> > mb();
> > B == 0;
> > }
> >
> > Do we really need the mb() in this case? How it is possible that STORE
> > goes before LOAD? "Obviously", the LOAD should be completed first, no?
>
> Suppose that A was most recently stored by a CPU that shares a store
> buffer with this CPU. Then it is possible that some other CPU sees
> the store to B as happening before the store that "A==0" above is
> loading from. This other CPU would naturally conclude that the store
> to B must have happened before the load from A.
Ah, I was confused by the comment,
smp_mb(); /* Don't call for memory barriers before we see zero. */
^^^^^^^^^^^^^^^^^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.
But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called under
fliplock).
> In more detail, suppose that CPU 0 and 1 share a store buffer, and
> that CPU 2 and 3 share a second store buffer. This happens naturally
> if CPUs 0 and 1 are really just different hardware threads within a
> single core.
>
> So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
> and that the cacheline for B is initially owned by CPUs 0 and 1.
> Then consider the following sequence of events:
>
> o CPU 0 stores zero to A. This is a cache miss, so the new value
> for A is placed in CPU 0's and 1's store buffer.
>
> o CPU 1 executes the above code, first loading A. It sees
> the value of A==0 in the store buffer, and therefore
> stores zero to B, which hits in the cache. (I am assuming
> that we left out the mb() above).
>
> o CPU 2 loads from B, which misses the cache, and gets the
> value that CPU 1 stored. Suppose it checks the value,
> and based on this check, loads A. The old value of A might
> still be in cache, which would lead CPU 2 to conclude that
> the store to B by CPU 1 must have happened before the store
> to A by CPU 0.
>
> Memory barriers would prevent this confusion.
Thanks a lot!!! This fills another gap in my understanding.
OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace "CPU 0 stores" with
"CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
Suppose that CPU 1 does
wmb();
B = 0
Can we assume that CPU 2 doing
if (B == 0) {
rmb();
must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?
> > > > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> > >
> > > I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
> > > transition has to be synchronized to the moving of the callbacks --
> > > either that or we need more GP_STAGES.
> >
> > Hmm. Still can't understand.
>
> Callbacks would be able to be injected into a grace period after it
> started.
Yes, but this is _exactly_ what the current code does in the scenario below,
> Or are you arguing that as long as interrupts remain disabled between
> the two events, no harm done?
no,
> > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> > (say, another CPU does rcu_try_flip_idle() in between).
> >
> > We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> > __rcu_advance_callbacks() again and moves the callbacks.
> >
> > So, it is still possible that "move callbacks" and "ack the flip" happen
> > out of order. But why this is bad?
Look, what happens is
// call_rcu()
rcu_flip_flag = rcu_flipped
insert the new callback
// timer irq
move the callbacks (the new one goes to wait[0])
But I still can't understand why this is bad,
> > This can't "speedup" the moving of our callbacks from next to done lists.
> > Yes, RCU state machine can switch to the next state/stage, but this looks
> > safe to me.
Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed
further events, we can't miss rcu_read_lock() section, no?
> > Help!
Please :)
> > if (rcu_ctrlblk.completed == rdp->completed)
> > rcu_try_flip();
> >
> > Could you clarify the check above? Afaics this is just optimization,
> > technically it is correct to rcu_try_flip() at any time, even if ->completed
> > are not synchronized. Most probably in that case rcu_try_flip_waitack() will
> > fail, but nothing bad can happen, yes?
>
> >From a conceptual viewpoint, if this CPU hasn't caught up with the
> last grace-period stage, it has no business trying to push forward to
> the next stage. So this might (or might not) happen to work with this
> particular implementation, it needs to stay as is. We need this code
> to be robust enough to optimize the grace-period latencies, right?
Yes, yes. I just wanted to be sure I didn't miss some other subtle reason.
> > void synchronize_sched(void)
> > {
> > struct migration_req req;
> >
> > req->task = NULL;
> > init_completion(&req.done);
> >
> > for_each_online_cpu(cpu) {
> > struct rq *rq = cpu_rq(cpu);
> > int online;
> >
> > spin_lock_irq(&rq->lock);
> > online = cpu_online(cpu); // HOTPLUG_CPU
> > if (online) {
> > list_add(&req->list, &rq->migration_queue);
> > req.done.done = 0;
> > }
> > spin_unlock_irq(&rq->lock);
> >
> > if (online) {
> > wake_up_process(rq->migration_thread);
> > wait_for_completion(&req.done);
> > }
> > }
> > }
> >
>
> I need to think about your approach above. It looks like you are
> leveraging the migration tasks, but I am concerned about concurrent
> hotplug events.
I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
interesting for us, we can miss it.
Hmm... but wake_up_process() should be moved under spin_lock().
Oleg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists