linux-kernel - Re: [PATCH v5] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240609222029.yebhborjptp3gr6a@airbuntu>
Date: Sun, 9 Jun 2024 23:20:29 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Christian Loehle <christian.loehle@....com>,
	Hongyan Xia <hongyan.xia2@....com>,
	John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5] sched: Consolidate cpufreq updates

On 06/05/24 14:22, Vincent Guittot wrote:
> Hi Qais,
> 
> On Sun, 2 Jun 2024 at 00:40, Qais Yousef <qyousef@...alina.io> wrote:
> >
> > On 05/30/24 11:46, Qais Yousef wrote:
> >
> > > +static __always_inline void
> > > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> > > +{
> >
> > I found a problem here. We should check if prev was sugov task. I hit a
> > corner case where we were constantly switching between RT task and sugov.
> >
> >         if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
> >                 /* Sugov just did an update, don't be too aggressive */
> >                 return;
> >         }
> >
> 
> I reran my test with this v5 and the fix above but the problem is
> still there, it waits for the next tick to update the frequency
> whereas the cpu was idle.

Hurmph. Sorry I forgot to rerun this test. I broke it again with this
optimization :( Maybe I can replace this with explicit check with util_avg ==
SCHED_CAPACITY_SCALE, though this is not generic enough..

	diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
	index 6d8d569cdb6a..d64d47b4471a 100644
	--- a/kernel/sched/fair.c
	+++ b/kernel/sched/fair.c
	@@ -4702,7 +4702,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
	 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
	 {
		u64 now = cfs_rq_clock_pelt(cfs_rq);
	-       unsigned long prev_util_avg = cfs_rq->avg.util_avg;

		/*
		 * Track task load average for carrying it to new CPU after migrated, and
	@@ -4736,16 +4735,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
		} else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
			update_tg_load_avg(cfs_rq);
		}
	-
	-       /*
	-        * This field is used to indicate whether a trigger of cpufreq update
	-        * is required. When the CPU is saturated, other load signals could
	-        * still be changing, but util_avg would have settled down, so ensure
	-        * that we don't trigger unnecessary updates as from fair policy point
	-        * of view, nothing has changed to cause a cpufreq update.
	-        */
	-       if (cfs_rq->decayed && prev_util_avg == cfs_rq->avg.util_avg)
	-               cfs_rq->decayed = false;
	 }

	 /*

> 
> Also continuing here the discussion started on v2:
> 
> I agree that in the current implementation we are probably calling way
> too much cpufreq_update, we can optimize some sequences and using the
> context switch is a good way to get a better sampling but this is not
> enough and we still need to call cpufreq_update in some other case
> involving enqueue. The delay of waiting for the next tick is not

Do you have any suggestions? I'm not sure how to classify different type of
enqueue events where some would need an update and others don't.

I think cases that involve wakeup preemption not causing a context switch AND
maybe a large change in util_avg?

> acceptable nor sustainable especially with 250 and lower HZ but I'm

I think it is fine for 250. I have been testing with this and didn't see
issues. But wider testing could yield different results.

> pretty sure it would be the same for some system using 1000HZ. IIUC
> new HW is becoming much more efficient at updating the frequency so it
> would not be a problem for this new system to update performance more
> frequently especially when it ends up being as simple as writing a
> value in a memory region without waiting for it to be applied (like
> cpufreq fast_switch methods). All this to say that always/only waiting
> for context switch or tick might be suitable for your case but it
> doesn't look like the right solution for all devices and systems

I just don't want us to end up with probabilistic approach. I am fine with more
updates, but we need to be more intentional/specific when it's truly needed.