linux-kernel - Re: [PATCH v5] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240609223346.4xlkcze3fg2bhhcn@airbuntu>
Date: Sun, 9 Jun 2024 23:33:46 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Christian Loehle <christian.loehle@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Hongyan Xia <hongyan.xia2@....com>,
	John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5] sched: Consolidate cpufreq updates

On 06/05/24 13:24, Christian Loehle wrote:
> On 5/30/24 11:46, Qais Yousef wrote:
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> > 
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> > 
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> > 
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> > 
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> > 
> > 1. context_switch()
> > 2. task_tick_fair()
> > 3. update_blocked_averages()
> > 4. on syscall that changes policy or uclamp values
> > 
> > The update at context switch should help guarantee that DL and RT get
> > the right frequency straightaway when they're RUNNING. As mentioned
> > though the update will happen slightly after enqueue_task(); though in
> > an ideal world these tasks should be RUNNING ASAP and this additional
> > delay should be negligible.
> 
> Do we care at all about PREEMPT_NONE (and voluntary) here? I assume no.
> Anyway one scenario that should regress when we don't update at RT enqueue:
> (Essentially means that util of higher prio dominates over lower, if
> higher is enqueued first.)
> System:
> OPP 0, cap: 102, 100MHz; OPP 1, cap: 1024, 1000MHz
> RT task A prio=0 runtime@...1=1ms, uclamp_min=0; RT task B prio=1 runtime@...1=1ms, uclamp_min=1024
> rate_limit_us = freq transition delay = 1 (assume basically instant switch)
> Let's say CONFIG_HZ=100 for the tick to not get in the way, doesn't really matter.
> 
> Before:
> t+0:		Enqueue task A switch to OPP0
> Running A at OPP 0
> t+2us:		Enqueue task B switch to OPP1
> t+1000us:	Task A done, switch to task B.
> t+2000us:	Task B done
> 
> Now:
> t+0:		Enqueue task A switch to OPP0
> Running A at OPP 0
> t+2us:		Enqueue task B
> t+10000us:	Task A done, switch to task B and OPP1
> t+11000us:	Task B done
> 
> Or am I missing something?

I think this is the correct behavior where each task gets to run at the correct
frequency, no?

Generally if the system is overloaded with RT tasks with same priority are
likely to end up stuck on the same CPU for that long (ie no other CPU in the
system is able to pull one of the tasks), relying on frequency to save the day
is wrong IMO. Userspace must ensure not to starve such busy tasks with
0 uclamp_min if the system being overloaded is likely scenario. And they need
to manage priorities correctly to ensure these busy RT tasks are not a hogger
if something else finds this latency not acceptable.

Proper Hard RT systems disable DVFS generally as they introduce unacceptable
delays.

Note that with today's code Task B request is most likely dropped and both will
end up running at OPP0.


Cheers

--
Qais Yousef