lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220712215259.6cb28bed@gandalf.local.home>
Date:   Tue, 12 Jul 2022 21:52:59 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     LKML <linux-kernel@...r.kernel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: schedstat false counting of domain load_balance() tried to move one
 or more tasks failed

I've been tasked to analyze the /proc/schedstat file to determine
appropriate metrics to look after in production. So I'm looking at both the
documentation and the code that generates it.

>From the documentation at https://docs.kernel.org/scheduler/sched-stats.html

(and Documentation/scheduler/sched-stats.rst for those of you that are
allergic to html)

   Domain statistics
   -----------------
   One of these is produced per domain for each cpu described. (Note that if
   CONFIG_SMP is not defined, *no* domains are utilized and these lines
   will not appear in the output.)

   domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

   The first field is a bit mask indicating what cpus this domain operates over.

   The next 24 are a variety of load_balance() statistics in grouped into types
   of idleness (idle, busy, and newly idle):

       1)  # of times in this domain load_balance() was called when the
           cpu was idle
       2)  # of times in this domain load_balance() checked but found
           the load did not require balancing when the cpu was idle
       3)  # of times in this domain load_balance() tried to move one or
           more tasks and failed, when the cpu was idle

I was looking at this #3 (which is also #11 and #19 for CPU_BUSY and
CPU_NEW_IDLE respectively). It states that it gets incremented when one or
more tasks were tried to be moved but failed. I found this is not always
the case.

We have:

	ld_moved = 0;
	/* Clear this flag as soon as we find a pullable task */
	env.flags |= LBF_ALL_PINNED;
	if (busiest->nr_running > 1) {
[..]
	}

	if (!ld_moved) {
		schedstat_inc(sd->lb_failed[idle]);

Where the lb_failed[] is that counter. I added the following code:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..4835ea4d9d01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9865,6 +9865,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct rq *busiest;
 	struct rq_flags rf;
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+	bool redo = false;
 
 	struct lb_env env = {
 		.sd		= sd,
@@ -10012,11 +10013,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
+				redo = true;
 				goto redo;
 			}
 			goto out_all_pinned;
 		}
-	}
+	} else if (!redo)
+		trace_printk("Did not try to move! %d\n", idle);
 
 	if (!ld_moved) {
 		schedstat_inc(sd->lb_failed[idle]);


And sure enough that triggers on CPU_IDLE and CPU_NEW_IDLE calls (I haven't
seen it for CPU_BUSY yet, but I didn't try).

Thus, if we get to that check for (busiest->nr_running > 1) and fail, then
we will increment that counter incorrectly.

Do we care? Should it be fixed? Should it be documented?

Thoughts?

-- Steve

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ