linux-kernel - Re: [PATCH,RFC] Add call_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20080408003449.873574d7.akpm@linux-foundation.org>
Date:	Tue, 8 Apr 2008 00:34:49 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	paulmck@...ux.vnet.ibm.com
Cc:	linux-kernel@...r.kernel.org, mathieu.desnoyers@...ymtl.ca,
	mingo@...e.hu, hch@...radead.org, mmlnx@...ibm.com,
	dipankar@...ibm.com, dsmith@...hat.com, rostedt@...dmis.org,
	adrian.bunk@...ial.fi, a.p.zijlstra@...llo.nl, ego@...ibm.com,
	niv@...ibm.com, dvhltc@...ibm.com, rusty@....ibm.com,
	jkenisto@...ux.vnet.ibm.com, oleg@...sign.ru
Subject: Re: [PATCH,RFC] Add call_rcu_sched()

On Sun, 6 Apr 2008 14:37:19 -0700 "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:

> Hello!
> 
> Third cut of patch to provide the call_rcu_sched().  This is again to
> synchronize_sched() as call_rcu() is to synchronize_rcu().
> 
> Should be fine for experimental use, but not ready for inclusion.

Let me know when to come out of hiding ;)

> Passes multi-hour rcutorture sessions with concurrent CPU hotplugging.
> 
> Fixes since the first version include a bug that could result in
> indefinite blocking (spotted by Gautham Shenoy), better resiliency
> against CPU-hotplug operations, and other minor fixes.
> 
> Fixes since the second version include reworking grace-period detection
> to avoid deadlocks that could happen when running concurrently with
> CPU hotplug, adding Mathieu's fix to avoid the softlockup messages,
> as well as Mathieu's fix to allow use earlier in boot.
> 
> Known/suspected shortcomings:
> 
> o	Only moderately tested on x86-64 and POWER -- a few hours of
> 	rcutorture with concurrent CPU hotplugging.  In particular, I
> 	still do not trust the sleep/wakeup logic between call_rcu_sched()
> 	and rcu_sched_grace_period().
> 
> o	Need to add call_rcu_sched() testing to rcutorture.
> 
> o	Still needs rcu_barrier_sched() -- intending to incorporate
> 	the version Mathieu provided.
> 
> This patch also fixes a long-standing bug in the earlier preemptable-RCU
> implementation of synchronize_rcu() that could result in loss of
> concurrent external changes to a task's CPU affinity mask.  I still cannot
> remember who reported this...
>
> ...
>
> +#define call_rcu_sched(head, func) call_rcu(head, func)
> +
>  extern void __rcu_init(void);
> +#define rcu_init_sched()	do { } while (0)

There are lots of creepy macros-which-probably-dont-need-to-be-macros in
here.

> +
> +static inline int
> +rcu_qsctr_inc_needed_dyntick(int cpu)

Unneeded newline.

> +{
> +	long curr;
> +	long snap;
> +	struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
> +
> +	curr = rdssp->dynticks;
> +	snap = rdssp->sched_dynticks_snap;
> +	smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
> +
> +	/*
> +	 * If the CPU remained in dynticks mode for the entire time
> +	 * and didn't take any interrupts, NMIs, SMIs, or whatever,
> +	 * then it cannot be in the middle of an rcu_read_lock(), so
> +	 * the next rcu_read_lock() it executes must use the new value
> +	 * of the counter.  Therefore, this CPU has been in a quiescent
> +	 * state the entire time, and we don't need to wait for it.
> +	 */
> +
> +	if ((curr == snap) && ((curr & 0x1) == 0))
> +		return 0;
> +
> +	/*
> +	 * If the CPU passed through or entered a dynticks idle phase with
> +	 * no active irq handlers, then, as above, this CPU has already
> +	 * passed through a quiescent state.
> +	 */
> +
> +	if ((curr - snap) > 2 || (snap & 0x1) == 0)
> +		return 0;
> +
> +	/* We need this CPU to go through a quiescent state. */
> +
> +	return 1;
> +}

That's a pretty big inline.  It only has a single callsite so the compiler
should inline it for us.  And if it grows a second callsite, the inlining
is probably wrong.

> +static inline int
> +rcu_qsctr_inc_needed(int cpu)

Unneeded newline.

>  /*
>   * Get here when RCU is idle.  Decide whether we need to
>   * move out of idle state, and return non-zero if so.
> @@ -821,6 +924,13 @@ void rcu_check_callbacks(int cpu, int us
>  	unsigned long flags;
>  	struct rcu_data *rdp = RCU_DATA_CPU(cpu);
>  
> +	if (user ||
> +	    (idle_cpu(cpu) && !in_softirq() &&
> +	     hardirq_count() <= (1 << HARDIRQ_SHIFT))) {

I think this test could do with a bigfatcomment explaining what it is doing.

> +		smp_mb();	/* Guard against aggressive schedule(). */
> +	     	rcu_qsctr_inc(cpu);
> +	}
> +
>  	rcu_check_mb(cpu);
>  	if (rcu_ctrlblk.completed == rdp->completed)
>  		rcu_try_flip();
>
> ...
>
> +
> +	/*
> +	 * The rcu_sched grace-period processing might have bypassed
> +	 * this CPU, given that it was not in the rcu_cpu_online_map
> +	 * when the grace-period scan started.  This means that the
> +	 * grace-period task might sleep.  So make sure that if this
> +	 * should happen, the first callback posted to this CPU will
> +	 * wake up the grace-period task if need be.
> +	 */
> +
> +	local_irq_save(flags);
> +	rdp = RCU_DATA_ME();
> +	spin_lock(&rdp->lock);

I assume that splitting the irq-disable from the spin_lock is a little
latency optimisation?

> +	rdp->rcu_sched_sleeping = 1;
> +	spin_unlock_irqrestore(&rdp->lock, flags);
>  }
>  
>  #else /* #ifdef CONFIG_HOTPLUG_CPU */
> @@ -993,26 +1129,194 @@ void call_rcu(struct rcu_head *head, voi
>  }
>  EXPORT_SYMBOL_GPL(call_rcu);
>  
> +void call_rcu_sched(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	int wake_gp = 0;
> +
> +	head->func = func;
> +	head->next = NULL;
> +	local_irq_save(flags);
> +	rdp = RCU_DATA_ME();
> +	spin_lock(&rdp->lock);
> +	*rdp->nextschedtail = head;
> +	rdp->nextschedtail = &head->next;
> +	if (rdp->rcu_sched_sleeping) {
> +
> +		/* Grace-period processing might be sleeping... */
> +
> +		rdp->rcu_sched_sleeping = 0;
> +		wake_gp = 1;
> +	}
> +	spin_unlock(&rdp->lock);
> +	local_irq_restore(flags);

spin_unlock_irqrestore() here would be consistent with the above.

> +	if (wake_gp) {
> +
> +		/* Wake up grace-period processing, unless someone beat us. */
> +
> +		spin_lock_irqsave(&rcu_ctrlblk.schedlock, flags);

If wake_gp!=0 is common then we could microoptimise straight-line
performance here by retaining the irq-offness from above.

> +		if (rcu_ctrlblk.sched_sleep != rcu_sched_sleeping)
> +			wake_gp = 0;
> +		rcu_ctrlblk.sched_sleep = rcu_sched_not_sleeping;
> +		spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> +		if (wake_gp)
> +			wake_up_interruptible(&rcu_ctrlblk.sched_wq);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_sched);
>
> ...
>
> +static int
> +rcu_sched_grace_period(void *arg)

Unneeded newline.

>  {
> -	cpumask_t oldmask;
> +	int couldsleep;		/* might sleep after current pass. */
> +	int couldsleepnext = 0; /* might sleep after next pass. */
>  	int cpu;
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	int ret;
>  
> -	if (sched_getaffinity(0, &oldmask) < 0)
> -		oldmask = cpu_possible_map;
> -	for_each_online_cpu(cpu) {
> -		sched_setaffinity(0, cpumask_of_cpu(cpu));
> -		schedule();
> -	}
> -	sched_setaffinity(0, oldmask);
> +	/*
> +	 * Each pass through the following loop handles one
> +	 * rcu_sched grace period cycle.
> +	 */
> +
> +	do {
> +		
> +		/* Save each CPU's current state. */
> +
> +		for_each_online_cpu(cpu) {

Numerous unneeded newline ;)

> +			dyntick_save_progress_counter_sched(cpu);
> +			save_qsctr_sched(cpu);
> +		}
> +
> +		/*
> +		 * Sleep for about an RCU grace-period's worth to
> +		 * allow better batching and to consume less CPU.
> +		 */
> +
> +		schedule_timeout_interruptible(HZ / 20);

eek, a magic number.

> +		/*
> +		 * If there was nothing to do last time, prepare to
> +		 * sleep at the end of the current grace period cycle.
> +		 */
> +
> +		couldsleep = couldsleepnext;
> +		couldsleepnext = 1;
> +		if (couldsleep) {
> +			spin_lock_irqsave(&rcu_ctrlblk.schedlock, flags);
> +			rcu_ctrlblk.sched_sleep = rcu_sched_sleep_prep;
> +			spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> +		}

If the above locking actually correct and needed?  The write to
rcu_ctrlblk.sched_sleep is a single word...

> +		/*
> +		 * Wait on each CPU in turn to have either visited
> +		 * a quiescent state or been in dynticks-idle mode.
> +		 */
> +
> +		for_each_online_cpu(cpu) {
> +			while (rcu_qsctr_inc_needed(cpu) &&
> +			       rcu_qsctr_inc_needed_dyntick(cpu)) {
> +				/* resched_cpu(cpu); */
> +				schedule_timeout_interruptible(1);
> +			}
> +		}
> +
> +		/*
> +		 * Advance callbacks for each CPU.
> +		 */
> +
> +		for_each_online_cpu(cpu) {

It's more conventional to omit the blank line after the above form of
comment block.

> +			rdp = RCU_DATA_CPU(cpu);
> +			spin_lock_irqsave(&rdp->lock, flags);
> +
> +			/*
> +			 * We are running on this CPU irq-disabled, so no
> +			 * CPU can go offline until we re-enable irqs.

but, but, but.  The cpu at `cpu' could have gone offline just before we
disabled local interrupts.

> +			 * Advance the callbacks!  We share normal RCU's
> +			 * donelist, since callbacks are invoked the
> +			 * same way in either case.
> +			 */
> +
> +			if (rdp->waitschedlist != NULL) {
> +				*rdp->donetail = rdp->waitschedlist;
> +				rdp->donetail = rdp->waitschedtail;
> +
> +				/*
> +				 * Next rcu_check_callbacks() will
> +				 * do the required raise_softirq().
> +				 */
> +			}
> +			if (rdp->nextschedlist != NULL) {
> +				rdp->waitschedlist = rdp->nextschedlist;
> +				rdp->waitschedtail = rdp->nextschedtail;
> +				couldsleep = 0;
> +				couldsleepnext = 0;
> +			} else {
> +				rdp->waitschedlist = NULL;
> +				rdp->waitschedtail = &rdp->waitschedlist;
> +			}
> +			rdp->nextschedlist = NULL;
> +			rdp->nextschedtail = &rdp->nextschedlist;
> +
> +			/* Mark sleep intention. */
> +
> +			rdp->rcu_sched_sleeping = couldsleep;
> +
> +			spin_unlock_irqrestore(&rdp->lock, flags);
> +		}
> +
> +		/* If we saw callbacks on the last scan, go deal with them. */
> +
> +		if (!couldsleep)
> +			continue;
> +
> +		/* Attempt to block... */
> +
> +		spin_lock_irqsave(&rcu_ctrlblk.schedlock, flags);
> +		if (rcu_ctrlblk.sched_sleep != rcu_sched_sleep_prep) {
> +
> +			/*
> +			 * Someone posted a callback after we scanned.
> +			 * Go take care of it.
> +			 */
> +
> +			spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> +			couldsleepnext = 0;
> +			continue;
> +		}
> +
> +		/* Block until the next person posts a callback. */
> +
> +		rcu_ctrlblk.sched_sleep = rcu_sched_sleeping;
> +		spin_unlock_irqrestore(&rcu_ctrlblk.schedlock, flags);
> +		ret = 0;
> +		__wait_event_interruptible(rcu_ctrlblk.sched_wq,
> +			rcu_ctrlblk.sched_sleep != rcu_sched_sleeping,
> +			ret);
> +		if (ret)
> +			flush_signals(current);

That flush_signals() was a surprise.  A desurprising comment would be nice.

> +		couldsleepnext = 0;
> +
> +	} while (!kthread_should_stop());
> +
> +	return (0);
>  }
> -EXPORT_SYMBOL_GPL(__synchronize_sched);
>  
>  /*
>   * Check to see if any future RCU-related work will need to be done
> @@ -1029,7 +1333,9 @@ int rcu_needs_cpu(int cpu)
>  
>  	return (rdp->donelist != NULL ||
>  		!!rdp->waitlistcount ||
> -		rdp->nextlist != NULL);
> +		rdp->nextlist != NULL ||
> +		rdp->nextschedlist != NULL ||
> +		rdp->waitschedlist != NULL);
>  }
>  
>  int rcu_pending(int cpu)
> @@ -1040,7 +1346,9 @@ int rcu_pending(int cpu)
>  
>  	if (rdp->donelist != NULL ||
>  	    !!rdp->waitlistcount ||
> -	    rdp->nextlist != NULL)
> +	    rdp->nextlist != NULL ||
> +	    rdp->nextschedlist != NULL ||
> +	    rdp->waitschedlist != NULL)
>  		return 1;
>  
>  	/* The RCU core needs an acknowledgement from this CPU. */
> @@ -1107,6 +1415,11 @@ void __init __rcu_init(void)
>  		rdp->donetail = &rdp->donelist;
>  		rdp->rcu_flipctr[0] = 0;
>  		rdp->rcu_flipctr[1] = 0;
> +		rdp->nextschedlist = NULL;
> +		rdp->nextschedtail = &rdp->nextschedlist;
> +		rdp->waitschedlist = NULL;
> +		rdp->waitschedtail = &rdp->waitschedlist;
> +		rdp->rcu_sched_sleeping = 0;
>  	}
>  	register_cpu_notifier(&rcu_nb);
>  
> @@ -1129,6 +1442,18 @@ void __init __rcu_init(void)
>  }
>  
>  /*
> + * Late-boot-time RCU initialization that must wait until after scheduler
> + * has been initialized.
> + */
> +void __init rcu_init_sched(void)
> +{
> +	rcu_sched_grace_period_task = kthread_run(rcu_sched_grace_period,
> +						  NULL,
> +						  "rcu_sched_grace_period");
> +	WARN_ON(IS_ERR(rcu_sched_grace_period_task));
> +}
> +
> +/*
>   * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
>   */
>  void synchronize_kernel(void)

I suspect I don't understand any of the RCU code any more.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/