[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20080408003449.873574d7.akpm@linux-foundation.org>
Date: Tue, 8 Apr 2008 00:34:49 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: paulmck@...ux.vnet.ibm.com
Cc: linux-kernel@...r.kernel.org, mathieu.desnoyers@...ymtl.ca,
mingo@...e.hu, hch@...radead.org, mmlnx@...ibm.com,
dipankar@...ibm.com, dsmith@...hat.com, rostedt@...dmis.org,
adrian.bunk@...ial.fi, a.p.zijlstra@...llo.nl, ego@...ibm.com,
niv@...ibm.com, dvhltc@...ibm.com, rusty@....ibm.com,
jkenisto@...ux.vnet.ibm.com, oleg@...sign.ru
Subject: Re: [PATCH,RFC] Add call_rcu_sched()
On Sun, 6 Apr 2008 14:37:19 -0700 "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:
> Hello!
>
> Third cut of patch to provide the call_rcu_sched(). This is again to
> synchronize_sched() as call_rcu() is to synchronize_rcu().
>
> Should be fine for experimental use, but not ready for inclusion.
Let me know when to come out of hiding ;)
> Passes multi-hour rcutorture sessions with concurrent CPU hotplugging.
>
> Fixes since the first version include a bug that could result in
> indefinite blocking (spotted by Gautham Shenoy), better resiliency
> against CPU-hotplug operations, and other minor fixes.
>
> Fixes since the second version include reworking grace-period detection
> to avoid deadlocks that could happen when running concurrently with
> CPU hotplug, adding Mathieu's fix to avoid the softlockup messages,
> as well as Mathieu's fix to allow use earlier in boot.
>
> Known/suspected shortcomings:
>
> o Only moderately tested on x86-64 and POWER -- a few hours of
> rcutorture with concurrent CPU hotplugging. In particular, I
> still do not trust the sleep/wakeup logic between call_rcu_sched()
> and rcu_sched_grace_period().
>
> o Need to add call_rcu_sched() testing to rcutorture.
>
> o Still needs rcu_barrier_sched() -- intending to incorporate
> the version Mathieu provided.
>
> This patch also fixes a long-standing bug in the earlier preemptable-RCU
> implementation of synchronize_rcu() that could result in loss of
> concurrent external changes to a task's CPU affinity mask. I still cannot
> remember who reported this...
>
> ...
>
> +#define call_rcu_sched(head, func) call_rcu(head, func)
> +
> extern void __rcu_init(void);
> +#define rcu_init_sched() do { } while (0)
There are lots of creepy macros-which-probably-dont-need-to-be-macros in
here.
> +
> +static inline int
> +rcu_qsctr_inc_needed_dyntick(int cpu)
Unneeded newline.
> +{
> + long curr;
> + long snap;
> + struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
> +
> + curr = rdssp->dynticks;
> + snap = rdssp->sched_dynticks_snap;
> + smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
> +
> + /*
> + * If the CPU remained in dynticks mode for the entire time
> + * and didn't take any interrupts, NMIs, SMIs, or whatever,
> + * then it cannot be in the middle of an rcu_read_lock(), so
> + * the next rcu_read_lock() it executes must use the new value
> + * of the counter. Therefore, this CPU has been in a quiescent
> + * state the entire time, and we don't need to wait for it.
> + */
> +
> + if ((curr == snap) && ((curr & 0x1) == 0))
> + return 0;
> +
> + /*
> + * If the CPU passed through or entered a dynticks idle phase with
> + * no active irq handlers, then, as above, this CPU has already
> + * passed through a quiescent state.
> + */
> +
> + if ((curr - snap) > 2 || (snap & 0x1) == 0)
> + return 0;
> +
> + /* We need this CPU to go through a quiescent state. */
> +
> + return 1;
> +}
That's a pretty big inline. It only has a single callsite so the compiler
should inline it for us. And if it grows a second callsite, the inlining
is probably wrong.
> +static inline int
> +rcu_qsctr_inc_needed(int cpu)
Unneeded newline.
> /*
> * Get here when RCU is idle. Decide whether we need to
> * move out of idle state, and return non-zero if so.
> @@ -821,6 +924,13 @@ void rcu_check_callbacks(int cpu, int us
> unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
>
> + if (user ||
> + (idle_cpu(cpu) && !in_softirq() &&
> + hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
I think this test could do with a bigfatcomment explaining what it is doing.
> + smp_mb(); /* Guard against aggressive schedule(). */
> + rcu_qsctr_inc(cpu);
> + }
> +
> rcu_check_mb(cpu);
> if (rcu_ctrlblk.completed == rdp->completed)
> rcu_try_flip();
>
> ...
>
> +
> + /*
> + * The rcu_sched grace-period processing might have bypassed
> + * this CPU, given that it was not in the rcu_cpu_online_map
> + * when the grace-period scan started. This means that the
> + * grace-period task might sleep. So make sure that if this
> + * should happen, the first callback posted to this CPU will
> + * wake up the grace-period task if need be.
> + */
> +
> + local_irq_save(flags);
> + rdp = RCU_DATA_ME();
> + spin_lock(&rdp->lock);
I assume that splitting the irq-disable from the spin_lock is a little
latency optimisation?
> + rdp->rcu_sched_sleeping = 1;
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> #else /* #ifdef CONFIG_HOTPLUG_CPU */
> @@ -993,26 +1129,194 @@ void call_rcu(struct rcu_head *head, voi
> }
> EXPORT_SYMBOL_GPL(call_rcu);
>
> +void call_rcu_sched(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
> +{
> + unsigned long flags;
> + struct rcu_data *rdp;
> + int wake_gp = 0;
> +
> + head->func = func;
> + head->next = NULL;
> + local_irq_save(flags);
> + rdp = RCU_DATA_ME();
> + spin_lock(&rdp->lock);
> + *rdp->nextschedtail = head;
> + rdp->nextschedtail = &head->next;
> + if (rdp->rcu_sched_sleeping) {
> +
> + /* Grace-period processing might be sleeping... */
> +
> + rdp->rcu_sched_sleeping = 0;
> + wake_gp = 1;
> + }
> + spin_unlock(&rdp->lock);
> + local_irq_restore(flags);
spin_unlock_irqrestore() here would be consistent with the above.
> + if (wake_gp) {
> +
> + /* Wake up grace-period processing, unless someone beat us. */
> +
> + spin_lock_irqsave(&rcu_ctrlblk.schedlock, flags);
If wake_gp!=0 is common then we could microoptimise straight-line
performance here by retaining the irq-offness from above.
> + if (rcu_ctrlblk.sched_sleep != rcu_sched_sleeping)
> + wake_gp = 0;
> + rcu_ctrlblk.sched_sleep = rcu_sched_not_sleeping;
> + spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> + if (wake_gp)
> + wake_up_interruptible(&rcu_ctrlblk.sched_wq);
> + }
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_sched);
>
> ...
>
> +static int
> +rcu_sched_grace_period(void *arg)
Unneeded newline.
> {
> - cpumask_t oldmask;
> + int couldsleep; /* might sleep after current pass. */
> + int couldsleepnext = 0; /* might sleep after next pass. */
> int cpu;
> + unsigned long flags;
> + struct rcu_data *rdp;
> + int ret;
>
> - if (sched_getaffinity(0, &oldmask) < 0)
> - oldmask = cpu_possible_map;
> - for_each_online_cpu(cpu) {
> - sched_setaffinity(0, cpumask_of_cpu(cpu));
> - schedule();
> - }
> - sched_setaffinity(0, oldmask);
> + /*
> + * Each pass through the following loop handles one
> + * rcu_sched grace period cycle.
> + */
> +
> + do {
> +
> + /* Save each CPU's current state. */
> +
> + for_each_online_cpu(cpu) {
Numerous unneeded newline ;)
> + dyntick_save_progress_counter_sched(cpu);
> + save_qsctr_sched(cpu);
> + }
> +
> + /*
> + * Sleep for about an RCU grace-period's worth to
> + * allow better batching and to consume less CPU.
> + */
> +
> + schedule_timeout_interruptible(HZ / 20);
eek, a magic number.
> + /*
> + * If there was nothing to do last time, prepare to
> + * sleep at the end of the current grace period cycle.
> + */
> +
> + couldsleep = couldsleepnext;
> + couldsleepnext = 1;
> + if (couldsleep) {
> + spin_lock_irqsave(&rcu_ctrlblk.schedlock, flags);
> + rcu_ctrlblk.sched_sleep = rcu_sched_sleep_prep;
> + spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> + }
If the above locking actually correct and needed? The write to
rcu_ctrlblk.sched_sleep is a single word...
> + /*
> + * Wait on each CPU in turn to have either visited
> + * a quiescent state or been in dynticks-idle mode.
> + */
> +
> + for_each_online_cpu(cpu) {
> + while (rcu_qsctr_inc_needed(cpu) &&
> + rcu_qsctr_inc_needed_dyntick(cpu)) {
> + /* resched_cpu(cpu); */
> + schedule_timeout_interruptible(1);
> + }
> + }
> +
> + /*
> + * Advance callbacks for each CPU.
> + */
> +
> + for_each_online_cpu(cpu) {
It's more conventional to omit the blank line after the above form of
comment block.
> + rdp = RCU_DATA_CPU(cpu);
> + spin_lock_irqsave(&rdp->lock, flags);
> +
> + /*
> + * We are running on this CPU irq-disabled, so no
> + * CPU can go offline until we re-enable irqs.
but, but, but. The cpu at `cpu' could have gone offline just before we
disabled local interrupts.
> + * Advance the callbacks! We share normal RCU's
> + * donelist, since callbacks are invoked the
> + * same way in either case.
> + */
> +
> + if (rdp->waitschedlist != NULL) {
> + *rdp->donetail = rdp->waitschedlist;
> + rdp->donetail = rdp->waitschedtail;
> +
> + /*
> + * Next rcu_check_callbacks() will
> + * do the required raise_softirq().
> + */
> + }
> + if (rdp->nextschedlist != NULL) {
> + rdp->waitschedlist = rdp->nextschedlist;
> + rdp->waitschedtail = rdp->nextschedtail;
> + couldsleep = 0;
> + couldsleepnext = 0;
> + } else {
> + rdp->waitschedlist = NULL;
> + rdp->waitschedtail = &rdp->waitschedlist;
> + }
> + rdp->nextschedlist = NULL;
> + rdp->nextschedtail = &rdp->nextschedlist;
> +
> + /* Mark sleep intention. */
> +
> + rdp->rcu_sched_sleeping = couldsleep;
> +
> + spin_unlock_irqrestore(&rdp->lock, flags);
> + }
> +
> + /* If we saw callbacks on the last scan, go deal with them. */
> +
> + if (!couldsleep)
> + continue;
> +
> + /* Attempt to block... */
> +
> + spin_lock_irqsave(&rcu_ctrlblk.schedlock, flags);
> + if (rcu_ctrlblk.sched_sleep != rcu_sched_sleep_prep) {
> +
> + /*
> + * Someone posted a callback after we scanned.
> + * Go take care of it.
> + */
> +
> + spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> + couldsleepnext = 0;
> + continue;
> + }
> +
> + /* Block until the next person posts a callback. */
> +
> + rcu_ctrlblk.sched_sleep = rcu_sched_sleeping;
> + spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> + ret = 0;
> + __wait_event_interruptible(rcu_ctrlblk.sched_wq,
> + rcu_ctrlblk.sched_sleep != rcu_sched_sleeping,
> + ret);
> + if (ret)
> + flush_signals(current);
That flush_signals() was a surprise. A desurprising comment would be nice.
> + couldsleepnext = 0;
> +
> + } while (!kthread_should_stop());
> +
> + return (0);
> }
> -EXPORT_SYMBOL_GPL(__synchronize_sched);
>
> /*
> * Check to see if any future RCU-related work will need to be done
> @@ -1029,7 +1333,9 @@ int rcu_needs_cpu(int cpu)
>
> return (rdp->donelist != NULL ||
> !!rdp->waitlistcount ||
> - rdp->nextlist != NULL);
> + rdp->nextlist != NULL ||
> + rdp->nextschedlist != NULL ||
> + rdp->waitschedlist != NULL);
> }
>
> int rcu_pending(int cpu)
> @@ -1040,7 +1346,9 @@ int rcu_pending(int cpu)
>
> if (rdp->donelist != NULL ||
> !!rdp->waitlistcount ||
> - rdp->nextlist != NULL)
> + rdp->nextlist != NULL ||
> + rdp->nextschedlist != NULL ||
> + rdp->waitschedlist != NULL)
> return 1;
>
> /* The RCU core needs an acknowledgement from this CPU. */
> @@ -1107,6 +1415,11 @@ void __init __rcu_init(void)
> rdp->donetail = &rdp->donelist;
> rdp->rcu_flipctr[0] = 0;
> rdp->rcu_flipctr[1] = 0;
> + rdp->nextschedlist = NULL;
> + rdp->nextschedtail = &rdp->nextschedlist;
> + rdp->waitschedlist = NULL;
> + rdp->waitschedtail = &rdp->waitschedlist;
> + rdp->rcu_sched_sleeping = 0;
> }
> register_cpu_notifier(&rcu_nb);
>
> @@ -1129,6 +1442,18 @@ void __init __rcu_init(void)
> }
>
> /*
> + * Late-boot-time RCU initialization that must wait until after scheduler
> + * has been initialized.
> + */
> +void __init rcu_init_sched(void)
> +{
> + rcu_sched_grace_period_task = kthread_run(rcu_sched_grace_period,
> + NULL,
> + "rcu_sched_grace_period");
> + WARN_ON(IS_ERR(rcu_sched_grace_period_task));
> +}
> +
> +/*
> * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
> */
> void synchronize_kernel(void)
I suspect I don't understand any of the RCU code any more.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists