lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 18 Jul 2014 12:16:33 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Bruno Wolff III <bruno@...ff.to>
Cc:	Dietmar Eggemann <dietmar.eggemann@....com>,
	Josh Boyer <jwboyer@...hat.com>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Scheduler regression from
 caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 12:34:49AM -0500, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 14:35:02 +0200,
>  Peter Zijlstra <peterz@...radead.org> wrote:
> >
> >In any case, can someone who can trigger this run with the below; its
> >'clean' for me, but supposedly you'll trigger a FAIL somewhere.
> 
> I got a couple of fail messages.
> 
> dmesg output is available in the bug as the following attachment:
> https://bugzilla.kernel.org/attachment.cgi?id=143361

Thanks!

[    0.252059] __sdt_alloc: allocated f255b020 with cpus: 
[    0.252147] __sdt_alloc: allocated f255b0e0 with cpus: 
[    0.252229] __sdt_alloc: allocated f255b120 with cpus: 
[    0.252311] __sdt_alloc: allocated f255b160 with cpus: 

[    0.252395] __sdt_alloc: allocated f255b1a0 with cpus: 
[    0.252477] __sdt_alloc: allocated f255b1e0 with cpus: 
[    0.252559] __sdt_alloc: allocated f255b220 with cpus: 
[    0.252641] __sdt_alloc: allocated f255b260 with cpus: 

[    0.253013] __sdt_alloc: allocated f255b2a0 with cpus: 
[    0.253097] __sdt_alloc: allocated f255b2e0 with cpus: 
[    0.253184] __sdt_alloc: allocated f255b320 with cpus: 
[    0.253265] __sdt_alloc: allocated f255b360 with cpus: 

[    0.253354] build_sched_groups: got group f255b020 with cpus: 
[    0.253436] build_sched_groups: got group f255b120 with cpus: 
[    0.253519] build_sched_groups: got group f255b1a0 with cpus: 
[    0.253600] build_sched_groups: got group f255b2a0 with cpus: 
[    0.253681] build_sched_groups: got group f255b2e0 with cpus: 

[    0.253762] build_sched_groups: got group f255b320 with cpus: 
[    0.253843] build_sched_groups: got group f255b360 with cpus: 
[    0.254004] build_sched_groups: got group f255b0e0 with cpus: 
[    0.254087] build_sched_groups: got group f255b160 with cpus: 
[    0.254170] build_sched_groups: got group f255b1e0 with cpus: 
[    0.254252] build_sched_groups: FAIL
[    0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
[    0.255004] build_sched_groups: FAIL
[    0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

So from previous msgs we know:

	CPU0	CPU1	CPU2	CPU3

D0	*		*		SMT
		*		*

D2	*	*	*	*	DIE


This gives us (from __sdt_alloc):

	020	0e0	120	160	SMT
	1a0	1e0	220	260	MC
	2a0	2e0	320	360	DIE

Given that you have a DIE domain, and MC is found degenerate, I'll
conclude that you do not have the shared L3 possible for your machine
and only have the dual socket, with 2 threads per socket.

So the domains _should_ look like:

D0	0,2	1,3	0,2	1,3
D1	0,2	1,3	0,2	1,3
D2	0,1,2,3 0,1,2,3	0,1,2,3	0,1,2,3

Assuming that, build_sched_groups(), which gets called for each cpu, for
each domain, we get:

D0g	020(0)		120(2)
D1g	1a0(0,2)
D2g	2a0(0,2)

So far so good, at this point we're in build_sched_groups, we have a
.cpu=0 @span=0-3 @covered=0,2 @i=0 and we're just about to start the
loop for @i=1.

	1 is not set in covered

	get_group(i=1, sdd, &sg)
	  @sd = *per_cpu_ptr(sdd->sd, 1); /* should be D2 for CPU1 */
	  @child = sd->child; /* should be D1 for CPU1: 1,3 */
	  @cpu = 1
	  @sg = *per_cpu_ptr(sdd->sg, 1); /* should be: 2e0 */

But instead we get 320 !?

The 2e0 group would cover 1,3, thereby increasing @cover to 0-3 and
we're done for CPU0. Instead things go on to return 360, more WTF!

So it looks like the actual domain tree is broken, and not what we
assumed it was.

Could I bother you to run with the below instead? It should also print
out the sched domain masks so we don't need to guess about them.

(make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)

> I also booted with early printk=keepsched_debug as requested by Dietmar.

can you make that: sched_debug ?

---
 kernel/sched/core.c | 22 ++++++++++++++++++++++
 lib/vsprintf.c      |  5 +++++
 2 files changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..4babcbbc11b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,17 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 			continue;
 
 		group = get_group(i, sdd, &sg);
+
+		if (!cpumask_empty(sched_group_cpus(sg)))
+			printk("%s: FAIL\n", __func__);
+
+		printk("%s: got group %p with cpus: %pc\n",
+				__func__,
+				sg,
+				sched_group_cpus(sg));
+
+		cpumask_clear(sched_group_cpus(sg));
+
 		cpumask_setall(sched_group_mask(sg));
 
 		for_each_cpu(j, span) {
@@ -6418,6 +6429,11 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			if (!sg)
 				return -ENOMEM;
 
+			printk("%s: allocated %p with cpus: %pc\n",
+					__func__,
+					sg,
+					sched_group_cpus(sg));
+
 			sg->next = sg;
 
 			*per_cpu_ptr(sdd->sg, j) = sg;
@@ -6474,6 +6490,12 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 	if (!sd)
 		return child;
 
+	printk("%s: cpu: %d level: %s cpu_map: %pc tl->mask: %pc\n",
+			__func__,
+			cpu, tl->name,
+			cpu_map,
+			tl->mask(cpu));
+
 	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
 		sd->level = child->level + 1;
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6fe2c84eb055..ac22c46fd6d0 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -28,6 +28,7 @@
 #include <linux/ioport.h>
 #include <linux/dcache.h>
 #include <linux/cred.h>
+#include <linux/cpumask.h>
 #include <net/addrconf.h>
 
 #include <asm/page.h>		/* for PAGE_SIZE */
@@ -1250,6 +1251,7 @@ int kptr_restrict __read_mostly;
  *           (default assumed to be phys_addr_t, passed by reference)
  * - 'd[234]' For a dentry name (optionally 2-4 last components)
  * - 'D[234]' Same as 'd' but for a struct file
+ * - 'c' For a cpumask list
  *
  * Note: The difference between 'S' and 'F' is that on ia64 and ppc64
  * function pointers are really function descriptors, which contain a
@@ -1389,6 +1391,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
 		return dentry_name(buf, end,
 				   ((const struct file *)ptr)->f_path.dentry,
 				   spec, fmt);
+	case 'c':
+		return buf + cpulist_scnprintf(buf, end - buf, ptr);
 	}
 	spec.flags |= SMALL;
 	if (spec.field_width == -1) {
@@ -1635,6 +1639,7 @@ int format_decode(const char *fmt, struct printf_spec *spec)
  *   case.
  * %*ph[CDN] a variable-length hex string with a separator (supports up to 64
  *           bytes of the input)
+ * %pc print a cpumask as comma-separated list
  * %n is ignored
  *
  * ** Please update Documentation/printk-formats.txt when making changes **

Content of type "application/pgp-signature" skipped

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ