linux-kernel - Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231229002529.sidy6wxmclhzlzib@airbuntu>
Date: Fri, 29 Dec 2023 00:25:29 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
	"Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
	Lukasz Luba <lukasz.luba@....com>, Wei Wang <wvw@...gle.com>,
	Rick Yiu <rickyiu@...gle.com>, Chung-Kai Mei <chungkai@...gle.com>,
	Hongyan Xia <hongyan.xia2@....com>
Subject: Re: [PATCH 1/4] sched/fair: Be less aggressive in calling
 cpufreq_update_util()

On 12/12/23 12:40, Qais Yousef wrote:
> On 12/12/23 12:06, Vincent Guittot wrote:
> 
> > > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  enqueue_throttle:
> > >         assert_list_leaf_cfs_rq(rq);
> > >
> > 
> > Here and in the other places below,  you lose :
> > 
> >  -       } else if (decayed) {
> > 
> > The decayed condition ensures a rate limit (~1ms) in the number of
> > calls to cpufreq_update_util.
> > 
> > enqueue/dequeue/tick don't create any sudden change in the PELT
> > signals that would require to update cpufreq of the change unlike
> > attach/detach
> 
> Okay, thanks for the clue. Let me rethink this again.

Thinking more about this. Do we really need to send freq updates at
enqueue/attach etc?

I did an experiment to remove all the updates except in three places:

1. Context switch (done unconditionally)
2. Tick
2. update_blocked_averages()

I tried the below patch on mac mini with m1 SoC, 6.6 kernel; speedometers
scores were the same (against this series).

I ran perf bench sched pipe to see if the addition in context switch will make
things worse

before (this series applied):

	# Running 'sched/pipe' benchmark:
	# Executed 1000000 pipe operations between two processes

	     Total time: 20.505 [sec]

	      20.505628 usecs/op
		  48767 ops/sec

after (proposed patch below applied on top):

	# Running 'sched/pipe' benchmark:
	# Executed 1000000 pipe operations between two processes

	     Total time: 19.475 [sec]

	      19.475838 usecs/op
		  51345 ops/sec

I tried against vanilla kernel (without any patches, including some dependency
backports) using schedutil governor

	# Running 'sched/pipe' benchmark:
	# Executed 1000000 pipe operations between two processes

	     Total time: 22.428 [sec]

	      22.428166 usecs/op
		  44586 ops/sec

(I got higher run to run variance against this kernel)

So things got better. I think we can actually unify all updates to be at
context switch and tick for all classes.

The one hole is for long running CFS tasks without context switch; no updates
until tick this means the dvfs headroom needs to cater for that as worst case
scenario now. I think this is the behavior today actually; but accidental
updates due to enqueue/dequeue could have helped to issue more updates. If none
of that happens, then updating load_avg at entity_tick() is what would have
triggered a frequency update.

I'm not sure if the blocked average one is required to be honest. But feared
that when the cpu goes to idle we might miss reducing frequencies vote for that
CPU - which matters on shared cpufreq policies.

I haven't done comprehensive testing of course. But would love to hear any
thoughts on how we can be more strategic and reduce the number of cpufreq
updates we send. And iowait boost state needs to be verified.

While testing this series more I did notice that short kworker context switches
still can cause the frequency to go high. I traced this back to
__balance_callbacks() in finish_lock_switch() after calling
uclamp_context_switch(). So I think we do have a problem of updates being
'accidental' and we need to improve the interaction with the governor to be
more intentional keeping in mind all the constraints we have today in h/w and
software.

--->8---


>From 6deed09be1d075afa3858ca62dd54826cdb60d44 Mon Sep 17 00:00:00 2001
From: Qais Yousef <qyousef@...alina.io>
Date: Tue, 26 Dec 2023 01:23:57 +0000
Subject: [PATCH] sched/fair: Update freq on tick and context switch and
 blocked avgs

Signed-off-by: Qais Yousef (Google) <qyousef@...alina.io>
---
 kernel/sched/cpufreq_schedutil.c |  3 ---
 kernel/sched/fair.c              | 13 +------------
 kernel/sched/sched.h             | 15 +--------------
 3 files changed, 2 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c0879a985097..553a3d7f02d8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -166,9 +166,6 @@ static inline bool ignore_short_tasks(int cpu,
 	struct task_struct *p = cpu_rq(cpu)->curr;
 	unsigned long task_util;
 
-	if (!(flags & SCHED_CPUFREQ_PERF_HINTS))
-		return false;
-
 	if (!fair_policy(p->policy))
 		return false;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d63eae534cec..3a30f78b37d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5717,8 +5717,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	sub_nr_running(rq, task_delta);
 
 done:
-	cpufreq_update_util(rq, 0);
-
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5811,8 +5809,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
-	cpufreq_update_util(rq, 0);
-
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
 		resched_curr(rq);
@@ -6675,8 +6671,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
-	cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
-
 	hrtick_update(rq);
 }
 
@@ -6754,7 +6748,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
-	cpufreq_update_util(rq, 0);
 	hrtick_update(rq);
 }
 
@@ -8338,7 +8331,6 @@ done: __maybe_unused;
 
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
-	cpufreq_update_util(rq, 0);
 
 	return p;
 
@@ -12460,7 +12452,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
-	cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
+	cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
 
 	task_tick_core(rq, curr);
 }
@@ -12585,7 +12577,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	struct sched_entity *se = &p->se;
 
 	detach_entity_cfs_rq(se);
-	cpufreq_update_util(task_rq(p), 0);
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -12593,7 +12584,6 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	struct sched_entity *se = &p->se;
 
 	attach_entity_cfs_rq(se);
-	cpufreq_update_util(task_rq(p), 0);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
@@ -12839,7 +12829,6 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
 			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
 			update_cfs_group(se);
 		}
-		cpufreq_update_util(rq, 0);
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 516187ea2b81..e1622e2b82be 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3076,20 +3076,7 @@ static inline bool uclamp_rq_is_capped(struct rq *rq)
 /* Request freq update on context switch if necessary */
 static inline void uclamp_context_switch(struct rq *rq)
 {
-	unsigned long uclamp_min;
-	unsigned long uclamp_max;
-	unsigned long util;
-
-	/* Only RT and FAIR tasks are aware of uclamp */
-	if (!rt_policy(current->policy) && !fair_policy(current->policy))
-		return;
-
-	uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
-	uclamp_max = uclamp_eff_value(current, UCLAMP_MAX);
-	util = rq->cfs.avg.util_avg;
-
-	if (uclamp_min > util || uclamp_max < util)
-		cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
+	cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
 }
 #else /* CONFIG_UCLAMP_TASK */
 static inline unsigned long uclamp_eff_value(struct task_struct *p,
-- 
2.40.1