linux-kernel - Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB kthread wakeups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140808173710.GA13483@grmbl.mre>
Date:	Fri, 8 Aug 2014 23:07:10 +0530
From:	Amit Shah <amit.shah@...hat.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	linux-kernel@...r.kernel.org, riel@...hat.com, mingo@...nel.org,
	laijs@...fujitsu.com, dipankar@...ibm.com,
	akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
	josh@...htriplett.org, niv@...ibm.com, tglx@...utronix.de,
	peterz@...radead.org, rostedt@...dmis.org, dhowells@...hat.com,
	edumazet@...gle.com, dvhart@...ux.intel.com, fweisbec@...il.com,
	oleg@...hat.com, sbw@....edu
Subject: Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB
 kthread wakeups

On (Fri) 08 Aug 2014 [09:25:02], Paul E. McKenney wrote:
> On Fri, Aug 08, 2014 at 02:10:56PM +0530, Amit Shah wrote:
> > On Friday 11 July 2014 07:05 PM, Paul E. McKenney wrote:
> > >From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> > >
> > >An 80-CPU system with a context-switch-heavy workload can require so
> > >many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > >tens of percent of a CPU just awakening things.  This clearly will not
> > >scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > >get behind, increasing grace-period latency.
> > >
> > >To avoid this problem, this commit divides the NOCB kthreads into leaders
> > >and followers, where the grace-period kthreads awaken the leaders each of
> > >whom in turn awakens its followers.  By default, the number of groups of
> > >kthreads is the square root of the number of CPUs, but this default may
> > >be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > >This reduces the number of wakeups done per grace period by the RCU
> > >grace-period kthread by the square root of the number of CPUs, but of
> > >course by shifting those wakeups to the leaders.  In addition, because
> > >the leaders do grace periods on behalf of their respective followers,
> > >the number of wakeups of the followers decreases by up to a factor of two.
> > >Instead of being awakened once when new callbacks arrive and again
> > >at the end of the grace period, the followers are awakened only at
> > >the end of the grace period.
> > >
> > >For a numerical example, in a 4096-CPU system, the grace-period kthread
> > >would awaken 64 leaders, each of which would awaken its 63 followers
> > >at the end of the grace period.  This compares favorably with the 79
> > >wakeups for the grace-period kthread on an 80-CPU system.
> > >
> > >Reported-by: Rik van Riel <riel@...hat.com>
> > >Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> > 
> > This patch causes KVM guest boot to not proceed after a while.
> > .config is attached, and boot messages are appeneded.  This commit
> > was pointed to by bisect, and reverting on current master (while
> > addressing a trivial conflict) makes the boot work again.
> > 
> > The qemu cmdline is
> > 
> > ./x86_64-softmmu/qemu-system-x86_64 -m 512 -smp 2 -cpu
> > host,+kvmclock,+x2apic -enable-kvm  -kernel
> > ~/src/linux/arch/x86/boot/bzImage /guests/f11-auto.qcow2  -append
> > 'root=/dev/sda2 console=ttyS0 console=tty0' -snapshot -serial stdio
> 
> I cannot reproduce this.  I am at commit a7d7a143d0b4c, in case that
> makes a difference.

Yea; I'm at that commit too.  And the version of qemu doesn't matter;
happens on F20's qemu-kvm-1.6.2-7.fc20.x86_64 as well as qemu.git
compiled locally.

> There are some things in your dmesg that look quite strange to me, though.
> 
> You have "--smp 2" above, but in your dmesg I see the following:
> 
> 	[    0.000000] setup_percpu: NR_CPUS:4 nr_cpumask_bits:4
> 	nr_cpu_ids:1 nr_node_ids:1
> 
> So your run somehow only has one CPU.  RCU agrees that there is only
> one CPU:

Yea; indeed.  There are MTRR warnings too; attaching the boot log of
failed run and diff to the successful run (rcu-good-notime.txt).

The failed run is on commit a7d7a143d0b4cb1914705884ca5c25e322dba693
and the successful run has these reverted on top:

187497fa5e9e9383820d33e48b87f8200a747c2a
b58cc46c5f6b57f1c814e374dbc47176e6b4938e
fbce7497ee5af800a1c350c73f3c3f103cb27a15

That is rcu-bad-notime.txt.

> 	[    0.000000] Preemptible hierarchical RCU implementation.
> 	[    0.000000] 	RCU debugfs-based tracing is enabled.
> 	[    0.000000] 	RCU lockdep checking is enabled.
> 	[    0.000000] 	Additional per-CPU info printed with stalls.
> 	[    0.000000] 	RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1.
> 	[    0.000000] 	Offload RCU callbacks from all CPUs
> 	[    0.000000] 	Offload RCU callbacks from CPUs: 0.
> 	[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
> 	[    0.000000] NO_HZ: Full dynticks CPUs: 1-3.
> 
> But NO_HZ thinks that there are four.  This appears to be due to NO_HZ
> looking at the compile-time constants, and I doubt that this would cause
> a problem.  But if there really is a CPU 1 that RCU doesn't know about,
> and it queues a callback, that callback will never be invoked, and you
> could easily see hangs.
> 
> Give that your .config says CONFIG_NR_CPUS=4 and your qemu says "--smp 2",
> why does nr_cpu_ids think that there is only one CPU?  Are you running
> this on a non-x86_64 CPU so that qemu only does UP or some such?

No; this is "Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz" on a ThinkPad
T420s.

In my attached boot logs, RCU does detect two cpus.  Here's the diff
between them.  I recompiled to remove the timing info so the diffs are
comparable:

$ diff -u /var/tmp/rcu-bad-notime.txt /var/tmp/rcu-good-notime.txt 
--- /var/tmp/rcu-bad-notime.txt	       2014-08-08 22:49:37.207745682 +0530
+++ /var/tmp/rcu-good-notime.txt       2014-08-08 22:49:04.886653844 +0530
@@ -1,6 +1,6 @@
 $ ./x86_64-softmmu/qemu-system-x86_64 -m 512 -smp 2 -cpu host,+kvmclock,+x2apic -enable-kvm  -kernel ~/src/linux/arch/x86/boot/bzImage /guests/f11-auto.qcow2  -append 'root=/dev/sda2 console=ttyS0 console=tty0'  -snapshot  -serial stdio
 Initializing cgroup subsys cpu
-Linux version 3.16.0+ (amit@...bl.mre) (gcc version 4.8.3 20140624 (Red Hat 4.8.3-1) (GCC) ) #79 SMP PREEMPT Fri Aug 8 22:47:38 IST 2014
+Linux version 3.16.0+ (amit@...bl.mre) (gcc version 4.8.3 20140624 (Red Hat 4.8.3-1) (GCC) ) #78 SMP PREEMPT Fri Aug 8 22:46:28 IST 2014
 Command line: root=/dev/sda2 console=ttyS0 console=tty0
 e820: BIOS-provided physical RAM map:
 BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
@@ -60,7 +60,7 @@
 e820: [mem 0x20000000-0xfeffbfff] available for PCI devices
 Booting paravirtualized kernel on KVM
 setup_percpu: NR_CPUS:4 nr_cpumask_bits:4 nr_cpu_ids:2 nr_node_ids:1
-PERCPU: Embedded 475 pages/cpu @ffff88001f800000 s1916544 r8192 d20864 u2097152
+PERCPU: Embedded 475 pages/cpu @ffff88001f800000 s1915904 r8192 d21504 u2097152
 KVM setup async PF for cpu 0
 kvm-stealtime: cpu 0, msr 1f80cbc0
 Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 128873
@@ -71,7 +71,7 @@
 xsave: enabled xstate_bv 0x7, cntxt size 0x340
 AGP: Checking aperture...
 AGP: No AGP bridge found
-Memory: 483812K/523768K available (4029K kernel code, 727K rwdata, 2184K rodata, 2872K init, 14172K bss, 39956K reserved)
+Memory: 483812K/523768K available (4028K kernel code, 727K rwdata, 2184K rodata, 2872K init, 14172K bss, 39956K reserved)
 SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
 Preemptible hierarchical RCU implementation.
 	     RCU debugfs-based tracing is enabled.
@@ -106,7 +106,7 @@
 Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
 Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
 debug: unmapping init [mem 0xffffffff81b85000-0xffffffff81b87fff]
-ftrace: allocating 17857 entries in 70 pages
+ftrace: allocating 17856 entries in 70 pages
 ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
 smpboot: CPU0: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz (fam: 06, model: 2a, stepping: 07)
 Performance Events: 16-deep LBR, SandyBridge events, Intel PMU driver.
@@ -138,4 +138,207 @@
 mtrr: your CPUs had inconsistent MTRRdefType settings
 mtrr: probably your BIOS does not setup all CPUs.
 mtrr: corrected configuration.
+ACPI: Added _OSI(Module Device)
+ACPI: Added _OSI(Processor Device)
+ACPI: Added _OSI(3.0 _SCP Extensions)
+ACPI: Added _OSI(Processor Aggregator Device)
+ACPI: Interpreter enabled
+ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S1_] (20140724/hwxface-580)
+ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_] (20140724/hwxface-580)
+ACPI: (supports S0 S3 S4 S5)
+ACPI: Using IOAPIC for interrupt routing
+PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
+ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
+acpi PNP0A03:00: _OSC: OS supports [Segments MSI]
+acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM

<followed by more bootup messages>

> The following is what I get (and what I would expect) with that setup:
> 
> 	[    0.000000] Hierarchical RCU implementation.
> 	[    0.000000]  RCU debugfs-based tracing is enabled.
> 	[    0.000000]  RCU lockdep checking is enabled.
> 	[    0.000000]  Additional per-CPU info printed with stalls.
> 	[    0.000000]  RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=2.
> 	[    0.000000]  Offload RCU callbacks from all CPUs
> 	[    0.000000]  Offload RCU callbacks from CPUs: 0-1.
> 	[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
> 	[    0.000000] NO_HZ: Full dynticks CPUs: 1-3.
> 
> So whatever did you do with CPU 1?  ;-)

Dunno; let's use the current logs here.

> Of course, if I tell qemu "--smp 1" instead of "--smp 2", then RCU thinks
> that there is only one CPU:
> 
> 	[    0.000000] Hierarchical RCU implementation.
> 	[    0.000000]  RCU debugfs-based tracing is enabled.
> 	[    0.000000]  RCU lockdep checking is enabled.
> 	[    0.000000]  Additional per-CPU info printed with stalls.
> 	[    0.000000]  RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1.
> 	[    0.000000]  Offload RCU callbacks from all CPUs
> 	[    0.000000]  Offload RCU callbacks from CPUs: 0.
> 	[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
> 	[    0.000000] NO_HZ: Full dynticks CPUs: 1-3.
> 
> But it still works fine for me.
> 
> > Using qemu.git.
> > 
> > Rik suggested collecting qemu stack traces, here they are:
> 
> And they do look like the system is waiting.
> 
> You do have a warning below.
> 
> [    0.000000] WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:136 __early_ioremap+0xf5/0x1c4()
> 
> Not sure if this is related, but it might be good to fix this one anyway.
> 
> 							Thanx, Paul




		Amit

View attachment "rcu-good-notime.txt" of type "text/plain" (17011 bytes)

View attachment "rcu-bad-notime.txt" of type "text/plain" (7034 bytes)