linux-kernel - Re: [PATCH, RFC] v7 scalable classic RCU implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081103203336.GG6792@linux.vnet.ibm.com>
Date:	Mon, 3 Nov 2008 12:33:36 -0800
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Manfred Spraul <manfred@...orfullife.com>
Cc:	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	mingo@...e.hu, akpm@...ux-foundation.org, dipankar@...ibm.com,
	josht@...ux.vnet.ibm.com, schamp@....com, niv@...ibm.com,
	dvhltc@...ibm.com, ego@...ibm.com, laijs@...fujitsu.com,
	rostedt@...dmis.org, peterz@...radead.org, penberg@...helsinki.fi,
	andi@...stfloor.org, tglx@...utronix.de
Subject: Re: [PATCH, RFC] v7 scalable classic RCU implementation

On Sun, Nov 02, 2008 at 09:10:55PM +0100, Manfred Spraul wrote:
> Paul E. McKenney wrote:
>> --- /dev/null
>> +++ b/Documentation/RCU/trace.txt
>> @@ -0,0 +1,398 @@
>> +CONFIG_RCU_TRACE debugfs Files and Formats
>> +
>> +
>> +The rcupreempt and rcutree implementations of RCU provide debugfs trace
>> +output that summarizes counters and state.  This information is useful 
>> for
>> +debugging RCU itself, and can sometimes also help to debug abuses of RCU.
>> +Note that the rcuclassic implementation of RCU does not provide debugfs
>> +trace output.
>> +
>>   
>
> What about some generic files, with the same content for all rcu backends?
> The implementation could be in rcupdate.c. At least counting the rcu 
> callbacks could be done from generic code, the grace periods could be 
> queried [Do all backends implement rcu_batches_completed?]

For the grace-period latencies, I use the attached kernel
module.  This module simply times synchronize_rcu() to obtain
the grace-period performance.  I have similar modules for other
performance measures.

(Please note that this code was written for my own use only, others
may find the kernel-module parameter names to be quite obnoxious.
But yes, nreaders=1 will give you one synchronize_rcu() -update-
task.  What can I say? I used this module to collect data for
http://www.research.ibm.com/journal/sj/472/guniguntala.pdf, and was not
considering use by others.)

I would rather avoid decorating the RCU code with grace-period-latency
measurement code, since this can be done independently with a kernel
module, and I suspect that more people read the RCU code than measure
the grace-period latency.

All backends do implement rcu_batches_completed(), but its return
value has different units on different RCU implementations.  :-(

> Attached is a hack that I use right now for myself.
> Btw - on my 4-cpu system, the average latency from call_rcu() to the rcu 
> callback is 4-5 milliseconds, (CONFIG_HZ_1000).

Hmmm...  I would expect that if you have some CPUs in dyntick idle mode.
But if I run treercu on an CONFIG_HZ_250 8-CPU Power box, I see 2.5
jiffies per grace period if CPUs are kept out of dyntick idle mode, and
4 jiffies per grace period if CPUs are allowed to enter dyntick idle mode.

Alternatively, if you were testing with multiple concurrent
synchronize_rcu() invocations, you can also see longer grace-period
latencies due to the fact that a new synchronize_rcu() must wait for an
earlier grace period to complete before starting a new one.

The 2.5 jiffies is expected when RCU is idle: the synchronize_rcu()
starts a new grace period immediately, it takes up to a jiffy for the
other CPUs do their scheduling-clock interrupt (which notices that the
grace period started), another jiffy for all the CPUs to do their
next scheduling-clock interrupt (which notices the quiescent state),
and on average a half jiffy for the initiating CPU to notice that the
grace period has completed.

I considered trying to trim an additional jiffy off of this grace-period
latency by having CPUs keep track of whether they were in a quiescent
state at the time that they noticed the start of a new grace period.
My attempts to do this ran afoul of races due to interrupt handlers
containing call_rcu(), but it might be possible for rcu_check_callbacks()
to record both the quiescent-state information and the grace-period
number.  But this needs more thought, as the race conditions are quite
subtle.

							Thanx, Paul

> --
>    Manfred

> rom bf385caf2484d41157a9bef1550cdbefecdefa1b Mon Sep 17 00:00:00 2001
> rom: Manfred Spraul <manfred@...orfullife.com>
> Date: Sun, 2 Nov 2008 20:38:11 +0100
> Subject: [PATCH] kernel/rcupdate.c: add generic rcu statistics
> 
> The patch adds a generic file to debugfs that contains a statistic about
> the performance of the rcu subsystem.
> The code adds a noticeable overhead, thus do not enable it unless you
> are interested in measuring the rcu performance.
> 
> The patch is a hack, for example it doesn't differentiate between
> call_rcu() and call_rcu_bh()
> 
> Signed-off-by: Manfred Spraul <manfred@...orfullife.com>
> 
> ---
>  include/linux/rcupdate.h |   24 ++++++++
>  init/Kconfig             |    9 +++
>  kernel/rcuclassic.c      |    6 +-
>  kernel/rcupdate.c        |  133 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/rcupreempt.c      |    6 +-
>  5 files changed, 172 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 69c81e2..fa23572 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -49,6 +49,9 @@
>   */
>  struct rcu_head {
>  	struct rcu_head *next;
> +#ifdef CONFIG_RCU_BENCHMARK
> +	unsigned long long time;
> +#endif
>  	void (*func)(struct rcu_head *head);
>  };
> 
> @@ -66,6 +69,27 @@ struct rcu_head {
>         (ptr)->next = NULL; (ptr)->func = NULL; \
>  } while (0)
> 
> +#ifdef CONFIG_RCU_BENCHMARK
> +extern void rcu_mark_start(struct rcu_head *head);
> +extern void rcu_mark_completed(struct rcu_head *head);
> +#endif
> +
> +static inline void rcu_inithead(struct rcu_head *head, void (*func)(struct rcu_head *head))
> +{
> +	head->func = func;
> +#ifdef CONFIG_RCU_BENCHMARK
> +	rcu_mark_start(head);
> +#endif
> +}
> +
> +static inline void rcu_callback(struct rcu_head *head)
> +{
> +#ifdef CONFIG_RCU_BENCHMARK
> +	rcu_mark_completed(head);
> +#endif
> +	head->func(head);
> +}
> +
>  /**
>   * rcu_read_lock - mark the beginning of an RCU read-side critical section.
>   *
> diff --git a/init/Kconfig b/init/Kconfig
> index 2227bad..ceeec8c 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -924,6 +924,15 @@ source "block/Kconfig"
>  config PREEMPT_NOTIFIERS
>  	bool
> 
> +config RCU_BENCHMARK
> +	bool
> +	default y
> +	depends on STOP_MACHINE
> +	help
> +	  This option adds per-rcu head statistics about the latency
> +	  of the rcu callbacks.
> +	  If unsure, say N.
> +
>  config STATE_RCU
>  	bool
>  	default y
> diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
> index e14e6b2..6774fcf 100644
> --- a/kernel/rcuclassic.c
> +++ b/kernel/rcuclassic.c
> @@ -261,7 +261,7 @@ void call_rcu(struct rcu_head *head,
>  {
>  	unsigned long flags;
> 
> -	head->func = func;
> +	rcu_inithead(head, func);
>  	local_irq_save(flags);
>  	__call_rcu(head, &rcu_ctrlblk, &__get_cpu_var(rcu_data));
>  	local_irq_restore(flags);
> @@ -289,7 +289,7 @@ void call_rcu_bh(struct rcu_head *head,
>  {
>  	unsigned long flags;
> 
> -	head->func = func;
> +	rcu_inithead(head, func);
>  	local_irq_save(flags);
>  	__call_rcu(head, &rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
>  	local_irq_restore(flags);
> @@ -343,7 +343,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
>  	while (list) {
>  		next = list->next;
>  		prefetch(next);
> -		list->func(list);
> +		rcu_callback(list);
>  		list = next;
>  		if (++count >= rdp->blimit)
>  			break;
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index ad63af8..1b4eca5 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -44,6 +44,9 @@
>  #include <linux/cpu.h>
>  #include <linux/mutex.h>
>  #include <linux/module.h>
> +#include <linux/fs.h>
> +#include <linux/debugfs.h>
> +#include <linux/stop_machine.h>
> 
>  enum rcu_barrier {
>  	RCU_BARRIER_STD,
> @@ -163,6 +166,136 @@ void rcu_barrier_sched(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_barrier_sched);
> 
> +#ifdef CONFIG_RCU_BENCHMARK
> +
> +DEFINE_PER_CPU(unsigned long long, rcu_time_entries);
> +DEFINE_PER_CPU(unsigned long long, rcu_time_val);
> +DEFINE_PER_CPU(unsigned long long, rcu_time_square);
> +
> +#define RCU_TIME_SCALING	1000
> +
> +void rcu_mark_start(struct rcu_head *head)
> +{
> +	struct timespec tv;
> +
> +	getnstimeofday(&tv);
> +
> +	head->time = tv.tv_sec;
> +	head->time *= NSEC_PER_SEC;
> +	head->time += tv.tv_nsec;
> +}
> +static DEFINE_RATELIMIT_STATE(rcumark_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				DEFAULT_RATELIMIT_BURST);
> +void rcu_mark_completed(struct rcu_head *head)
> +{
> +	unsigned long flags;
> +	struct timespec tv;
> +	unsigned long long now;
> +
> +	getnstimeofday(&tv);
> +	now = tv.tv_sec;
> +	now *= NSEC_PER_SEC;
> +	now += tv.tv_nsec;
> +	now -= head->time;
> +	/* safety check, against uninitialized rcu heads */
> +	WARN_ON_RATELIMIT(now > 600*NSEC_PER_SEC, &rcumark_rs);
> +
> +	now /= RCU_TIME_SCALING;
> +	local_irq_save(flags);
> +	__get_cpu_var(rcu_time_entries)++;
> +	__get_cpu_var(rcu_time_val) += now;
> +	now = now*now;
> +	__get_cpu_var(rcu_time_square) += now;
> +	local_irq_restore(flags);
> +}
> +
> +#define RCUMARK_BUFSIZE (256*NR_CPUS)
> +
> +static char rcumark_buf[RCUMARK_BUFSIZE];
> +static DEFINE_MUTEX(rcumark_mutex);
> +
> +static ssize_t rcumark_read(struct file *filp, char __user *buffer,
> +				size_t count, loff_t *ppos)
> +{
> +	int i;
> +	ssize_t bcount;
> +	char *buf = rcumark_buf;
> +	char *ebuf = &rcumark_buf[RCUMARK_BUFSIZE];
> +	unsigned long long entries, val, square;
> +
> +	mutex_lock(&rcumark_mutex);
> +
> +	entries = val = square = 0;
> +	for_each_possible_cpu(i) {
> +		buf += snprintf(buf, ebuf - buf,
> +			"rcu: cpu %d completed=%Ld  val=%Ld square %Ld\n",
> +			i, per_cpu(rcu_time_entries, i),
> +			per_cpu(rcu_time_val, i), per_cpu(rcu_time_square, i));
> +		entries += per_cpu(rcu_time_entries, i);
> +		val += per_cpu(rcu_time_val, i);
> +		square += per_cpu(rcu_time_square, i);
> +	}
> +	buf += snprintf(buf, ebuf - buf, "total: completed=%Ld  val=%Ld square %Ld (scale %d)\n",
> +			entries, val, square, RCU_TIME_SCALING);
> +
> +	/* avg */
> +	val = val/entries;
> +	square = int_sqrt(square/entries - val*val);
> +	buf += snprintf(buf, ebuf - buf, "total: avg=%Ld stddev=%Ld steps/sec %ld\n",
> +			val, square, NSEC_PER_SEC/RCU_TIME_SCALING);
> +
> +	bcount = simple_read_from_buffer(buffer, count, ppos,
> +			rcumark_buf, strlen(rcumark_buf));
> +	mutex_unlock(&rcumark_mutex);
> +	return bcount;
> +}
> +
> +static int __rcumark_reset(void *unused)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		per_cpu(rcu_time_entries, i) = 0;
> +		per_cpu(rcu_time_val, i) = 0;
> +		per_cpu(rcu_time_square, i) = 0;
> +	}
> +	return 0;
> +}
> +
> +static ssize_t rcumark_write(struct file *filp, const char __user *buffer,
> +				size_t count, loff_t *ppos)
> +{
> +	mutex_lock(&rcumark_mutex);
> +	stop_machine(__rcumark_reset, NULL, NULL);
> +	mutex_unlock(&rcumark_mutex);
> +
> +	return count;
> +}
> +
> +static struct file_operations rcumark_fops = {
> +	.owner = THIS_MODULE,
> +	.read = rcumark_read,
> +	.write = rcumark_write
> +};
> +
> +static struct dentry *rcudir, *mdata;
> +
> +static int __init rcu_mark_init(void)
> +{
> +	rcudir = debugfs_create_dir("rcumark", NULL);
> +	if (!rcudir)
> +		goto out;
> +	mdata = debugfs_create_file("markdata", 0444, rcudir,
> +						NULL, &rcumark_fops);
> +
> +out:
> +	return 0;
> +}
> +
> +module_init(rcu_mark_init);
> +
> +#endif /* CONFIG_RCU_BENCHMARK */
> +
>  void __init rcu_init(void)
>  {
>  	__rcu_init();
> diff --git a/kernel/rcupreempt.c b/kernel/rcupreempt.c
> index 7a8849b..54326c0 100644
> --- a/kernel/rcupreempt.c
> +++ b/kernel/rcupreempt.c
> @@ -1100,7 +1100,7 @@ static void rcu_process_callbacks(struct softirq_action *unused)
>  	spin_unlock_irqrestore(&rdp->lock, flags);
>  	while (list) {
>  		next = list->next;
> -		list->func(list);
> +		rcu_callback(list);
>  		list = next;
>  		RCU_TRACE_ME(rcupreempt_trace_invoke);
>  	}
> @@ -1111,7 +1111,7 @@ void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
>  	unsigned long flags;
>  	struct rcu_data *rdp;
> 
> -	head->func = func;
> +	rcu_inithead(head, func);
>  	head->next = NULL;
>  	local_irq_save(flags);
>  	rdp = RCU_DATA_ME();
> @@ -1130,7 +1130,7 @@ void call_rcu_sched(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
>  	struct rcu_data *rdp;
>  	int wake_gp = 0;
> 
> -	head->func = func;
> +	rcu_inithead(head, func);
>  	head->next = NULL;
>  	local_irq_save(flags);
>  	rdp = RCU_DATA_ME();
> -- 
> 1.5.6.5
> 


View attachment "rcureadperf.c" of type "text/x-csrc" (8584 bytes)