linux-kernel - Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZSORUczD6tp2fbe7@chenyu5-mobl2.ccr.corp.intel.com>
Date:   Mon, 9 Oct 2023 13:36:17 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        "Mel Gorman" <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>, Tim Chen <tim.c.chen@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>, <x86@...nel.org>
Subject: Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost
 idle prev CPU

On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote:
> On 9/30/23 03:11, Chen Yu wrote:
> > Hi Mathieu,
> > 
> > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote:
> > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases
> > > select_task_rq towards the previous CPU if it was almost idle
> > > (avg_load <= 0.1%).
> > 
> > Yes, this is a promising direction IMO. One question is that,
> > can cfs_rq->avg.load_avg be used for percentage comparison?
> > If I understand correctly, load_avg reflects that more than
> > 1 tasks could have been running this runqueue, and the
> > load_avg is the direct proportion to the load_weight of that
> > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value
> > that load_avg can reach, it is the sum of
> > 1024 * (y + y^1 + y^2 ... )
> > 
> > For example,
> > taskset -c 1 nice -n -20 stress -c 1
> > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg"
> >    .load_avg                      : 88763
> >    .load_avg                      : 1024
> > 
> > 88763 is higher than LOAD_AVG_MAX=47742
> 
> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow,
> but it appears that it does not happen in practice.
> 
> That being said, if the cutoff is really at 0.1% or 0.2% of the real max,
> does it really matter ?
> 
> > Maybe the util_avg can be used for precentage comparison I suppose?
> [...]
> > Or
> > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ?
> 
> Unfortunately using util_avg does not seem to work based on my testing.
> Even at utilization thresholds at 0.1%, 1% and 10%.
> 
> Based on comments in fair.c:
> 
>  * CPU utilization is the sum of running time of runnable tasks plus the
>  * recent utilization of currently non-runnable tasks on that CPU.
> 
> I think we don't want to include currently non-runnable tasks in the
> statistics we use, because we are trying to figure out if the cpu is a
> idle-enough target based on the tasks which are currently running, for the
> purpose of runqueue selection when waking up a task which is considered at
> that point in time a non-runnable task on that cpu, and which is about to
> become runnable again.
> 
>

Based on the discussion, another effort to inhit task migration is to make
WA_BIAS prefers previous CPU rather than the current CPU. However it did not
show much difference with/without this change applied. I think this is because
although wake_affine_weight() chooses the previous CPU, in select_idle_sibling()
it would still prefer the current CPU to the previous CPU if no idle CPU is detected.
Based on this I did the following changes in select_idle_sibling():

1. When the system is underloaded, change the sequence of idle CPU checking.
   If both the target and previous CPU are idle, choose previous CPU first.

2. When the system is overloaded, and all CPUs are busy, choose the previous
   CPU over the target CPU.

hackbench -g 16 -f 20 -l 480000 -s 100

Before the patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 81.076

After the patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 77.527

track the task migration count in 10 seconds:
kretfunc:select_task_rq_fair
{
        $p = (struct task_struct *)args->p;
        if ($p->comm == "hackbench") {
                if ($p->thread_info.cpu == retval) {
                        @wakeup_prev = count();
                } else if (retval == cpu) {
                        @wakeup_curr = count();
                } else {
                        @wakeup_migrate = count();
                }
        }
}

Before the patch:
@wakeup_prev: 8369160
@wakeup_curr: 3624480
@wakeup_migrate: 523936

After the patch
@wakeup_prev: 15465952
@wakeup_curr: 214540
@wakeup_migrate: 65020

The percentage of wakeup on previous CPU has been increased from
8369160 / (8369160 + 3624480 + 523936) = 66.85% to
15465952 / (15465952 + 214540 + 65020) = 98.22%.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e2a69af8be36..9131cb359723 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7264,18 +7264,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
-	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, target))
-		return target;
-
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
+	 * The previous CPU is checked prio to the target CPU to inhibit
+	 * costly task migration.
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
 		return prev;
 
+	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+	    asym_fits_cpu(task_util, util_min, util_max, target))
+		return target;
+
 	/*
 	 * Allow a per-cpu kthread to stack with the wakee if the
 	 * kworker thread and the tasks previous CPUs are the same.
@@ -7342,6 +7344,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	 /* if all CPUs are busy, prefer previous CPU to inhibit migration */
+	if (prev != target && cpus_share_cache(prev, target))
+		return prev;
+
 	return target;
 }
 
-- 
2.25.1