linux-kernel - Re: [PATCH, RFC, tip/core/rcu] scalable classic RCU implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1219447772.5197.98.camel@josh-work.beaverton.ibm.com>
Date:	Fri, 22 Aug 2008 16:29:32 -0700
From:	Josh Triplett <josht@...ux.vnet.ibm.com>
To:	paulmck@...ux.vnet.ibm.com
Cc:	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	mingo@...e.hu, akpm@...ux-foundation.org, manfred@...orfullife.com,
	dipankar@...ibm.com, schamp@....com, niv@...ibm.com,
	dvhltc@...ibm.com, ego@...ibm.com, laijs@...fujitsu.com,
	rostedt@...dmis.org
Subject: Re: [PATCH, RFC, tip/core/rcu] scalable classic RCU implementation

On Thu, 2008-08-21 at 16:43 -0700, Paul E. McKenney wrote:
> Hello!
> 
> Experimental, not for inclusion.
> 
> Attached is a patch to Classic RCU that applies a hierarchy, greatly
> reducing the contention on the top-level lock for large machines.
> This passes mild rcutorture testing on x86 and ppc64, but is most
> definitely not ready for inclusion.  It is OK for experimental work
> assuming sufficiently brave experimenters.  See also Manfred Spraul's
> patch at http://lkml.org/lkml/2008/8/21/336 (or his earlier work from
> 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
> We will converge onto a common patch in the fullness of time, but
> are currently exploring different regions of the design space.
> 
> This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
> of the RCU hierarchy.  Defaults to 32 on 32-bit machines and 64 on
> 64-bit machines.  If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
> there is no hierarchy.  By default, the RCU initialization code will
> adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
> architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
> this balancing, allowing the hierarchy to be exactly aligned to the
> underlying hardware.  Up to two levels of hierarchy are permitted
> (in addition to the root node), allowing up to 16,384 CPUs on 32-bit
> systems and up to 262,144 CPUs on 64-bit systems.  I just know that I
> am going to regret saying this, but this seems more than sufficient
> for the foreseeable future.  (Some architectures might wish to set
> CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
> If this becomes a real problem, additional levels can be added, but I
> doubt that it will make a significant difference on real hardware.)
> 
> In the common case, a given CPU will manipulate its private rcu_data
> structure and the rcu_node structure that it shares with its immediate
> neighbors.  This can reduce both lock and memory contention by multiple
> orders of magnitude, which should eliminate the need for the strange
> manipulations that are reported to be required when running Linux on
> very large systems.
> 
> Some shortcomings:
> 
> o	The interface to dynticks is clumsy at best.  Improvements
> 	on the way.
> 
> o	CPU onlining and offlining is probably broken.  Will be tested.
> 
> o	The check-CPU-stalls code is busted.  Will be fixed.
> 
> o	There are probably hangs, rcutorture failures, &c.
> 
> o	There is not yet a human-readable design document.  Will be fixed.
> 
> o	The largest machine I can get my hands on at the moment only
> 	has 8 CPUs, which really doesn't stress this algorithm much.
> 
> If you want to use this against a Linus kernel, the following will work:
> 
> Start with 2.6.27-rc3.
> 
> Apply http://www.rdrop.com/users/paulmck/patches/paulmck-rcu.2008.08.20a.patch
> which catches you up to a recent linux-2.6-tip tip/core/rcu commit.
> 
> Apply http://www.rdrop.com/users/paulmck/patches/2.6.27-rc3-hierRCU-6.patch
> which gets you the current hierarchical RCU implementation.
> 
> Thoughts?

Looks great.  A few comments below.

> Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> ---
> 
>  include/linux/rcuclassic.h |  154 ++++--
>  kernel/Kconfig.preempt     |   31 +
>  kernel/Makefile            |    3 
>  kernel/rcuclassic.c        | 1095 +++++++++++++++++++++++++++++----------------
>  kernel/rcuclassic_trace.c  |  219 +++++++++
>  5 files changed, 1076 insertions(+), 426 deletions(-)
> 
> diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
> index 1658995..97e646a 100644
> --- a/include/linux/rcuclassic.h
> +++ b/include/linux/rcuclassic.h
> @@ -18,6 +18,7 @@
>   * Copyright IBM Corporation, 2001

", 2008", here and elsewhere?

>   * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> @@ -26,8 +27,10 @@
>   * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
>   *
>   * For detailed explanation of Read-Copy Update mechanism see -
> - * 		Documentation/RCU
> - *
> + * 	Documentation/RCU
> + * 	http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?)
> + * 	http://lwn.net/Articles/263130/ (What is RCU's Usage?)
> + * 	http://lwn.net/Articles/264090/ (What is RCU's API? + references)
>   */

Why put these references here rather than in Documentation/RCU?  It
seems easier to keep documentation up to date in one place.  If you
think these represent a good "getting started" set of documents, how
about a Documentation/RCU/ReadTheseFirst with links to them, or how
about linking to them from whatisRCU.txt?

>  #ifndef __LINUX_RCUCLASSIC_H
> @@ -40,69 +43,136 @@
>  #include <linux/cpumask.h>
>  #include <linux/seqlock.h>
> 
> +/*
> + * Define the shape of the rcu_node hierarchy based on NR_CPUS and
> + * CONFIG_RCU_FANOUT.
> + */
> 
> -/* Global control variables for rcupdate callback mechanism. */
> -struct rcu_ctrlblk {
> -	long	cur;		/* Current batch number.                      */
> -	long	completed;	/* Number of the last completed batch         */
> -	long	pending;	/* Number of the last pending batch           */
> -#ifdef CONFIG_DEBUG_RCU_STALL
> -	unsigned long gp_check;	/* Time grace period should end, in seconds.  */
> -#endif /* #ifdef CONFIG_DEBUG_RCU_STALL */
> -
> -	int	signaled;
> +#define MAX_RCU_LEVELS 3
> +#if NR_CPUS <= CONFIG_RCU_FANOUT
> +#define NUM_RCU_LEVELS 1
> +#define NUM_RCU_LEVEL_1 1
> +#define NUM_RCU_LEVEL_2 NR_CPUS
> +#define NUM_RCU_LEVEL_3 0
> +#define NUM_RCU_LEVEL_4 0
> +#define NUM_RCU_NODES NUM_RCU_LEVEL_1
> +#elif NR_CPUS <= CONFIG_RCU_FANOUT * CONFIG_RCU_FANOUT
> +#define NUM_RCU_LEVELS 2
> +#define NUM_RCU_LEVEL_1 1
> +#define NUM_RCU_LEVEL_2 \
> +	(((NR_CPUS) + (CONFIG_RCU_FANOUT) - 1) / (CONFIG_RCU_FANOUT))
> +#define NUM_RCU_LEVEL_3 NR_CPUS
> +#define NUM_RCU_LEVEL_4 0
> +#define NUM_RCU_NODES \
> +	((NUM_RCU_LEVEL_1) + (NUM_RCU_LEVEL_2))
> +#elif NR_CPUS <= CONFIG_RCU_FANOUT * CONFIG_RCU_FANOUT * CONFIG_RCU_FANOUT
> +#define NUM_RCU_LEVELS 3
> +#define RCU_FANOUT_SQ ((CONFIG_RCU_FANOUT) * (CONFIG_RCU_FANOUT))
> +#define NUM_RCU_LEVEL_1 1
> +#define NUM_RCU_LEVEL_2 \
> +	(((NR_CPUS) + (RCU_FANOUT_SQ) - 1) / (RCU_FANOUT_SQ))
> +#define NUM_RCU_LEVEL_3 \
> +	((NR_CPUS) + (CONFIG_RCU_FANOUT) - 1) / (CONFIG_RCU_FANOUT)
> +#define NUM_RCU_LEVEL_4 NR_CPUS
> +#define NUM_RCU_NODES \
> +	((NUM_RCU_LEVEL_1) + \
> +	 (NUM_RCU_LEVEL_2) + \
> +	 (NUM_RCU_LEVEL_3))
> +#else
> +#error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
> +#endif

This should get replaced by the revised version you followed up with.

> -	spinlock_t	lock	____cacheline_internodealigned_in_smp;
> -	cpumask_t	cpumask; /* CPUs that need to switch in order    */
> -				 /* for current batch to proceed.        */
> +/*
> + * Definition for node within the RCU grace-period-detection hierarchy.
> + */
> +struct rcu_node {
> +	spinlock_t lock;
> +	unsigned long	qsmask;	/* CPUs or groups that need to switch in      */
> +				/*  order for current grace period to proceed.*/
> +	unsigned long	qsmaskinit;
> +				/* Per-GP initialization for qsmask.	      */
> +	int	grplo;		/* lowest-numbered CPU or group here.	      */
> +	int	grphi;		/* highest-numbered CPU or group here.	      */
> +	char	grpnum;		/* CPU/group number for next level up.	      */
> +	char	level;		/* root is at level 0.			      */

These four fields should use sized types, and preferably unsigned types.

> +	struct rcu_node *parent;
>  } ____cacheline_internodealigned_in_smp;
> 
> -/* Is batch a before batch b ? */
> -static inline int rcu_batch_before(long a, long b)
> -{
> -	return (a - b) < 0;
> -}
> +/*
> + * RCU global state, including node hierarchy.  This hierarchy is
> + * represented in "heap" form in a dense array.  The root (first level)
> + * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
> + * level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]),
> + * and the third level in ->node[m+1] and following (->node[m+1] referenced
> + * by ->level[2]).  The number of levels is determined by the number of
> + * CPUs and by CONFIG_RCU_FANOUT.  Small systems will have a "hierarchy"
> + * consisting of a single rcu_node.
> + */
> +struct rcu_state {
> +	struct rcu_node node[NUM_RCU_NODES];	/* Hierarchy. */
> +	struct rcu_node *level[NUM_RCU_LEVELS];	/* Hierarchy levels. */
> +	int levelcnt[MAX_RCU_LEVELS + 1];	/* # nodes in each level. */
> +	int levelspread[NUM_RCU_LEVELS];	/* kids/node in each level. */

These two should use sized types.

> +
> +	/* The following fields are guarded by the root rcu_node's lock. */
> +
> +	char	signaled ____cacheline_internodealigned_in_smp;
> +						/* sent GP-kick IPIs? */

u8 or bool, depending on semantics.  If just a simple flag, how about
bool?

> +	long	gpnum;				/* Current gp number. */
> +	long	completed;			/* # of last completed gp. */
> +	spinlock_t onofflock;			/* exclude on/offline and */
> +						/*  starting new GP. */
> +};
> 
> -/* Is batch a after batch b ? */
> -static inline int rcu_batch_after(long a, long b)
> -{
> -	return (a - b) > 0;
> -}
> +#define RCU_DONE_TAIL		0	/* Also RCU_WAIT head. */
> +#define RCU_WAIT_TAIL		1	/* Also RCU_NEXT_READY head. */
> +#define RCU_NEXT_READY_TAIL	2	/* Also RCU_NEXT head. */
> +#define RCU_NEXT_TAIL		3
> +#define RCU_NEXT_SIZE		4
> 
>  /* Per-CPU data for Read-Copy UPdate. */
>  struct rcu_data {
> -	/* 1) quiescent state handling : */
> -	long		quiescbatch;     /* Batch # for grace period */
> -	int		passed_quiesc;	 /* User-mode/idle loop etc. */
> -	int		qs_pending;	 /* core waits for quiesc state */
> +	/* 1) quiescent-state and grace-period handling : */
> +	long		completed;	/* Track rsp->completed gp number */
> +					/*  in order to detect GP end. */
> +	long		gpnum;		/* Highest gp number that this CPU */
> +					/*  is aware of having started. */
> +	int		passed_quiesc;	/* User-mode/idle loop etc. */
> +	int		qs_pending;	/* Core waits for quiesc state. */

Looks like several whitespace changes occurred here; several of these
lines didn't actually change except in whitespace.

The same comment about sized types applies here, but these fields didn't
actually change in this patch.

> +	struct rcu_node *mynode;	/* This CPU's leaf of hierarchy */
> 
>  	/* 2) batch handling */
>  	/*
> -	 * if nxtlist is not NULL, then:
> -	 * batch:
> -	 *	The batch # for the last entry of nxtlist
> -	 * [*nxttail[1], NULL = *nxttail[2]):
> -	 *	Entries that batch # <= batch
> +	 * If nxtlist is not NULL, it is partitioned as follows.
> +	 * Any of the partitions might be empty, in which case the
> +	 * pointer to that partition will be equal to the pointer for
> +	 * the following partition.  When the list is empty, all of
> +	 * the nxttail elements point to nxtlist, which is NULL.
> +	 *
> +	 * [*nxttail[2], NULL = *nxttail[3]):
> +	 *	Entries that might have arrived after current GP ended
> +	 * [*nxttail[1], *nxttail[2]):
> +	 *	Entries known to have arrived before current GP ended
>  	 * [*nxttail[0], *nxttail[1]):
> -	 *	Entries that batch # <= batch - 1
> +	 *	Entries that batch # <= ->completed - 1: waiting for current GP
>  	 * [nxtlist, *nxttail[0]):
> -	 *	Entries that batch # <= batch - 2
> +	 *	Entries that batch # <= ->completed
>  	 *	The grace period for these entries has completed, and
>  	 *	the other grace-period-completed entries may be moved
>  	 *	here temporarily in rcu_process_callbacks().
>  	 */
> -	long  	       	batch;
>  	struct rcu_head *nxtlist;
> -	struct rcu_head **nxttail[3];
> -	long            qlen; 	 	 /* # of queued callbacks */
> -	struct rcu_head *donelist;
> -	struct rcu_head **donetail;
> -	long		blimit;		 /* Upper limit on a processed batch */
> +	struct rcu_head **nxttail[RCU_NEXT_SIZE];
> +	long            qlen; 	 	/* # of queued callbacks */
> +	long		blimit;		/* Upper limit on a processed batch */

Some whitespace changes again here; several of these lines didn't change
except in whitespace.

>  	int cpu;
>  	struct rcu_head barrier;
>  };
> 
> +extern struct rcu_state rcu_state;
>  DECLARE_PER_CPU(struct rcu_data, rcu_data);
> +
> +extern struct rcu_state rcu_bh_state;
>  DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);

Why extern and in the header?  I don't see anything else using them.

>  /*
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index 9fdba03..43062bf 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -68,7 +68,6 @@ config PREEMPT_RCU
> 
>  config RCU_TRACE
>  	bool "Enable tracing for RCU - currently stats in debugfs"
> -	depends on PREEMPT_RCU

Might want to document in the commit message that you have tracing
information through RCU_TRACE, and that it applies to non-preemptible
RCU as well now.

>  	select DEBUG_FS
>  	default y
>  	help
> @@ -77,3 +76,33 @@ config RCU_TRACE
> 
>  	  Say Y here if you want to enable RCU tracing
>  	  Say N if you are unsure.
> +
> +config RCU_FANOUT
> +	int "Hierarchical RCU fanout value"
> +	range 2 64 if 64BIT
> +	range 2 32 if !64BIT
> +	depends on CLASSIC_RCU
> +	default 64 if 64BIT
> +	default 32 if !64BIT
> +	help
> +	  This option controls the fanout of hierarchical implementations
> +	  of RCU, allowing RCU to work efficiently on machines with
> +	  large numbers of CPUs.  This value must be at least the cube
> +	  root of NR_CPUS, which allows NR_CPUS up to 32,768 for 32-bit
> +	  systems and up to 262,144 for 64-bit systems.
> +
> +	  Select a specific number if testing RCU itself.

...or if attempting to tune for a specific NUMA system.

> +	  Take the default if unsure.
> +
> +config RCU_FANOUT_EXACT
> +	bool "Disable hierarchical RCU auto-balancing"
> +	depends on CLASSIC_RCU
> +	default n
> +	help
> +	  This option forces use of the exact RCU_FANOUT value specified,
> +	  regardless of imbalances in the hierarchy.  This can be useful
> +	  on systems with strong NUMA behavior.
> +
> +	  Without RCU_FANOUT_EXACT, the code will balance the hierarchy.

You might want to give a specific example of a NUMA machine, the
appropriate value to use on that machine, and the result with and
without RCU_FANOUT_EXACT.

> +	  Say n if unsure.
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4e1d7df..d838fbd 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -75,6 +75,9 @@ obj-$(CONFIG_SECCOMP) += seccomp.o
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
>  obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
> +ifeq ($(CONFIG_CLASSIC_RCU),y)
> +obj-$(CONFIG_RCU_TRACE) += rcuclassic_trace.o
> +endif
>  ifeq ($(CONFIG_PREEMPT_RCU),y)
>  obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
>  endif

It might actually make sense here to do this instead:

ifeq ($(CONFIG_RCU_TRACE),y)
obj-$(CONFIG_CLASSIC_RCU) += rcuclassic_trace.o
obj-$(CONFIG_PREEMPT_RCU) += rcupreempt_trace.o
endif

> diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
> index 01e761a..5584b22 100644
> --- a/kernel/rcuclassic.c
> +++ b/kernel/rcuclassic.c
> @@ -27,7 +28,10 @@
>   * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
>   *
>   * For detailed explanation of Read-Copy Update mechanism see -
> - * 		Documentation/RCU
> + * 	Documentation/RCU
> + * 	http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?)
> + * 	http://lwn.net/Articles/263130/ (What is RCU's Usage?)
> + * 	http://lwn.net/Articles/264090/ (What is RCU's API? + references)

Same comment as before; maintaining these in a single place seems
easier.

> +struct rcu_state rcu_state = RCU_STATE_INITIALIZER(rcu_state);
>  DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
> +
> +struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh_state);
>  DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };

How about making these state structures static, along with removing the
extern in the header?

> -static int blimit = 10;
> -static int qhimark = 10000;
> -static int qlowmark = 100;
> +static int blimit = 10;		/* Maximum callbacks per softirq. */
> +static int qhimark = 10000;	/* If this many pending, ignore blimit. */
> +static int qlowmark = 100;	/* Once only this many pending, use blimit. */

Indentation mismatch on the comments?

>  #ifdef CONFIG_SMP
> -static void force_quiescent_state(struct rcu_data *rdp,
> -			struct rcu_ctrlblk *rcp)
> +static void force_quiescent_state(struct rcu_state *rsp)
>  {
>  	int cpu;
> -	cpumask_t cpumask;
>  	unsigned long flags;
> 
>  	set_need_resched();
> -	spin_lock_irqsave(&rcp->lock, flags);
> -	if (unlikely(!rcp->signaled)) {
> -		rcp->signaled = 1;
> +	if (!spin_trylock_irqsave(&rsp->onofflock, flags))
> +		return;

This seems to make force_quiescent_state rather less forceful.

> +/*
> + * Does the current CPU require a yet-as-unscheduled grace period?
> + */
> +static inline int
> +cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
> +{
> +	return *rdp->nxttail[RCU_DONE_TAIL] &&
> +	       ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum);
> +}

ACCESS_ONCE, like memory barriers, benefits from an accompanying
explanation.

> -#else
> +#else /* #ifdef CONFIG_HOTPLUG_CPU */
> 
> -static void rcu_offline_cpu(int cpu)
> +static inline void
> +rcu_offline_cpu(int cpu)
>  {
>  }

No need to explicitly say "inline"; GCC should do the right thing here.
Same comment applies a couple of other places in your patch.


> @@ -658,14 +806,19 @@ int rcu_needs_cpu(int cpu)
>  	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
>  	struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);
> 
> -	return !!rdp->nxtlist || !!rdp_bh->nxtlist || rcu_pending(cpu);
> +	return !!*rdp->nxttail[RCU_DONE_TAIL] ||
> +	       !!*rdp_bh->nxttail[RCU_DONE_TAIL] ||
> +	       rcu_pending(cpu);

!! seems unnecessary here.

> +void call_rcu_bh(struct rcu_head *head,
> +				void (*func)(struct rcu_head *rcu))
> +{
> +	unsigned long flags;
> +
> +	head->func = func;
> +	head->next = NULL;
> +	local_irq_save(flags);
> +	__call_rcu(head, &rcu_bh_state, &__get_cpu_var(rcu_bh_data));
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_bh);

This comment applies to the original code, but:
You only call __call_rcu twice, in call_rcu and call_rcu_bh.  Both
times, you set head first, then wrap the call with local_irq_save.  How
about moving both into __call_rcu, making call_rcu and call_rcu_bh
one-liners?

> --- /dev/null
> +++ b/kernel/rcuclassic_trace.c

> +static struct mutex rcuclassic_trace_mutex;

static DEFINE_MUTEX(rcuclassic_trace_mutex);
Then you don't need mutex_init later in your init function.

> +static char *rcuclassic_trace_buf;
> +#define RCUPREEMPT_TRACE_BUF_SIZE 4096

Did you perhaps want PAGE_SIZE?

- Josh Triplett


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/