linux-kernel - Re: [PATCH, RFC] v4 scalable classic RCU implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 5 Sep 2008 16:52:35 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	paulmck@...ux.vnet.ibm.com
Cc:	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	mingo@...e.hu, manfred@...orfullife.com, dipankar@...ibm.com,
	josht@...ux.vnet.ibm.com, schamp@....com, niv@...ibm.com,
	dvhltc@...ibm.com, ego@...ibm.com, laijs@...fujitsu.com,
	rostedt@...dmis.org, peterz@...radead.org, penberg@...helsinki.fi,
	andi@...stfloor.org
Subject: Re: [PATCH, RFC] v4 scalable classic RCU implementation

On Fri, 5 Sep 2008 16:04:11 -0700 "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:
>
> ...
>
> > > +#if (NR_CPUS) <= RCU_FANOUT
> > > +#  define NUM_RCU_LVLS	      1
> > > +#  define NUM_RCU_LVL_0	      1
> > > +#  define NUM_RCU_LVL_1	      (NR_CPUS)
> > > +#  define NUM_RCU_LVL_2	      0
> > > +#  define NUM_RCU_LVL_3	      0
> > > +#elif (NR_CPUS) <= RCU_FANOUT_SQ
> > > +#  define NUM_RCU_LVLS	      2
> > > +#  define NUM_RCU_LVL_0	      1
> > > +#  define NUM_RCU_LVL_1	      (((NR_CPUS) + RCU_FANOUT - 1) / RCU_FANOUT)
> > > +#  define NUM_RCU_LVL_2	      (NR_CPUS)
> > > +#  define NUM_RCU_LVL_3	      0
> > > +#elif (NR_CPUS) <= RCU_FANOUT_CUBE
> > > +#  define NUM_RCU_LVLS	      3
> > > +#  define NUM_RCU_LVL_0	      1
> > > +#  define NUM_RCU_LVL_1	      (((NR_CPUS) + RCU_FANOUT_SQ - 1) / RCU_FANOUT_SQ)
> > > +#  define NUM_RCU_LVL_2	      (((NR_CPUS) + (RCU_FANOUT) - 1) / (RCU_FANOUT))
> > > +#  define NUM_RCU_LVL_3	      NR_CPUS
> > > +#else
> > > +# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
> > > +#endif /* #if (NR_CPUS) <= RCU_FANOUT */
> > 
> > Using NR_CPUS for anything at all is grossly, grossly inaccurate. 
> > Vendors can and will ship kernels with NR_CPUS=1024 and their customers
> > can and will run those kernels on 4-cpu machines.  Lots of customers.
> > 
> > That's a two-and-a-half-order-of-magnitude inaccuracy.  It makes all
> > your above work meaningless.
> > 
> > To be useful, these decisions should be made at runtime.
> 
> I agree in principle, but this case is an exception.
> 
> Suppose that we have NR_CPUS=1024 on a 4-CPU 64-bit machine.  Since 64^2
> is greater than 1024, we end up with a two-level hierarchy, with one
> rcu_node structure at the root and 16 rcu_node leaf structures, each
> of which takes up a single 128-byte cache line.  There will be two such
> structures in the system, one for rcu and one for rcu_bh.
> 
> So I do not believe that this will be a problem.
> 
> One laptops, this is even less of an issue -- NR_CPUS=8 on my laptop,
> which would reduce to a pair rcu_node structures, one for rcu, the other
> for rcu_bh.

Is it likely that anyone will ship kernels with NR_CPUS=8?  What are
distros presently using, and what will they be using 1-2 years hence?

> Making the decision at runtime would bloat the code by much more than the
> extra data consumed.  And introduce yet more races between online/offline
> and everything else.  Besides, this way I am being bloat-compatible
> with DEFINE_PER_CPU().  ;-)
> 
> > > +#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
> > > +#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
> > > +
> > > +/*
> > > + * Definition for node within the RCU grace-period-detection hierarchy.
> > > + */
> > > +struct rcu_node {
> > > +	spinlock_t lock;
> > > +	unsigned long qsmask;	/* CPUs or groups that need to switch in */
> > > +				/*  order for current grace period to proceed.*/
> > > +	unsigned long qsmaskinit;
> > > +				/* Per-GP initialization for qsmask. */
> > > +	unsigned long grpmask;	/* Mask to apply to parent qsmask. */
> > > +	int	grplo;		/* lowest-numbered CPU or group here. */
> > > +	int	grphi;		/* highest-numbered CPU or group here. */
> > > +	u8	grpnum;		/* CPU/group number for next level up. */
> > > +	u8	level;		/* root is at level 0. */
> > > +	struct rcu_node *parent;
> > > +} ____cacheline_internodealigned_in_smp;
> > 
> > So this is a 4096-byte structure on some setups.
> 
> You lost me on this one.  On a 64-bit system, I have 8 bytes each for
> lock, qsmask, qsmaskinit, grpmask, and parent, four bytes each for grplo
> and grphi, and another byte each for grpnum and level, for a total of
> 50 bytes for each struct rcu_node, which comes to a single cache line
> for most large system.  Given the default CONFIG_RCU_FANOUT=64 and
> NR_CPUS=4096, we have a two-level hierarchy with one root rcu_node
> structure and 64 leaf rcu_node structures.  This gives a total of
> 65 cache lines.

____cacheline_internodealigned_in_smp will expand this structure to
4096 bytes on CONFIG_X86_VSMP=y.

> > It's a pretty big pill to swallow.  Nice performance testing results
> > will help it to slide down.
> 
> Well, the read side is easy -- exactly the same code sequence as for
> Classic RCU.  On the update side, this is more of a bug fix for large
> numbers of CPUs, where unadorned Classic RCU is said to suffer terminal
> lock contention.  I will see what I can come up with, but at the end of
> the day, this will need some testing on machines larger than the 128-CPU
> systems that I have access to.
> 

OK, thanks.  As it's effectively a bugfix, a full description of the
bug would grease some wheels.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/