linux-kernel - Re: [PATCH v7] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bfcd8175-c6bc-41a2-8ee9-93061e446c40@arm.com>
Date: Tue, 13 Aug 2024 17:26:07 +0100
From: Christian Loehle <christian.loehle@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Qais Yousef <qyousef@...alina.io>, "Rafael J. Wysocki"
 <rafael@...nel.org>, Viresh Kumar <viresh.kumar@...aro.org>,
 Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
 <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Valentin Schneider <vschneid@...hat.com>, Hongyan Xia
 <hongyan.xia2@....com>, John Stultz <jstultz@...gle.com>,
 linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7] sched: Consolidate cpufreq updates

On 8/13/24 09:27, Vincent Guittot wrote:
> On Tue, 13 Aug 2024 at 10:25, Vincent Guittot
> <vincent.guittot@...aro.org> wrote:
>>
>> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@....com> wrote:
>>>
>>> On 7/28/24 19:45, Qais Yousef wrote:
>>>> Improve the interaction with cpufreq governors by making the
>>>> cpufreq_update_util() calls more intentional.
>>>>
>>>> At the moment we send them when load is updated for CFS, bandwidth for
>>>> DL and at enqueue/dequeue for RT. But this can lead to too many updates
>>>> sent in a short period of time and potentially be ignored at a critical
>>>> moment due to the rate_limit_us in schedutil.
>>>>
>>>> For example, simultaneous task enqueue on the CPU where 2nd task is
>>>> bigger and requires higher freq. The trigger to cpufreq_update_util() by
>>>> the first task will lead to dropping the 2nd request until tick. Or
>>>> another CPU in the same policy triggers a freq update shortly after.
>>>>
>>>> Updates at enqueue for RT are not strictly required. Though they do help
>>>> to reduce the delay for switching the frequency and the potential
>>>> observation of lower frequency during this delay. But current logic
>>>> doesn't intentionally (at least to my understanding) try to speed up the
>>>> request.
>>>>
>>>> To help reduce the amount of cpufreq updates and make them more
>>>> purposeful, consolidate them into these locations:
>>>>
>>>> 1. context_switch()
>>>> 2. task_tick_fair()
>>>> 3. sched_balance_update_blocked_averages()
>>>> 4. on sched_setscheduler() syscall that changes policy or uclamp values
>>>> 5. on check_preempt_wakeup_fair() if wakeup preemption failed
>>>> 6. on __add_running_bw() to guarantee DL bandwidth requirements.
>>>>
>>>> The update at context switch should help guarantee that RT get the right
>>>> frequency straightaway when they're RUNNING. As mentioned though the
>>>> update will happen slightly after enqueue_task(); though in an ideal
>>>> world these tasks should be RUNNING ASAP and this additional delay
>>>> should be negligible. For fair tasks we need to make sure we send
>>>> a single update for every decay for the root cfs_rq. Any changes to the
>>>> rq will be deferred until the next task is ready to run, or we hit TICK.
>>>> But we are guaranteed the task is running at a level that meets its
>>>> requirements after enqueue.
>>>>
>>>> To guarantee RT and DL tasks updates are never missed, we add a new
>>>> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
>>>> already running at the right freq, the governor will end up doing
>>>> nothing, but we eliminate the risk of the task ending up accidentally
>>>> running at the wrong freq due to rate_limit_us.
>>>>
>>>> Similarly for iowait boost, we ignore rate limits. We also handle a case
>>>> of a boost reset prematurely by adding a guard in sugov_iowait_apply()
>>>> to reduce the boost after 1ms which seems iowait boost mechanism relied
>>>> on rate_limit_us and cfs_rq.decayed preventing any updates to happen
>>>> soon after iowait boost.
>>>>
>>>> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
>>>> time stamps otherwise we can end up delaying updates for normal
>>>> requests.
>>>
>>> Hi Qais,
>>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming
>>> freq updates still bothered me so let me share my thoughts even though
>>> it might be niche enough for us not to care.
>>>
>>> 1. On fast_switch systems, assuming they are fine with handling the
>>> actual updates, we have a bit more work on each context_switch() and
>>> some synchronisation, too. That should be fine, if anything there's
>>> some performance regression in a couple of niche cases.
>>>
>>> 2. On !fast_switch systems this gets more interesting IMO. So we have
>>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting)
>>> update request. This task will preempt whatever and currently will
>>> pretty much always be running on the CPU it ran last on (so first CPU
>>> of the PD).
>>
>> The !fast_switch is a bit of concern for me too but not for the same
>> reason and maybe the opposite of yours IIUC your proposal below:
>>
>> With fast_switch we have the following sequence:
>>
>> sched_switch() to task A
>> cpufreq_driver_fast_switch -> write new freq target
>> run task A
>>
>> This is pretty straight forward but we have the following sequence
>> with !fast_switch
>>
>> sched_switch() to task A
>> queue_irq_work -> raise an IPI on local CPU
>> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always
>> with 1 CPU per PD)
>> sched_switch() to sugov dl task
>> __cpufreq_driver_target() which can possibly block on a lock
>> sched_switch() to task A
>> run task A
>>
> 
> sent a bit too early
> 
>> We can possibly have 2 context switch and one IPi for each "normal"
>> context switch which is not really optimal
> 
> It would be good to find a way to skip the spurious back and forth
> between the normal task and sugov

Just to confirm I understand your concern correctly, that's more or
less the behavior without Qais' patch as well though, isn't it?
Ignoring the move from "This happens at enqueue" vs. "this
happens at context switch".
Since sugov doesn't queue any work if the desired frequency doesn't
change I imagine it's not too bad?
Or are you more concerned that the work is queued multiple times
simultaneously for multiple CPUs of the same PD? If so we can
work around that with the work_in_progress state to at least limit
that window by a lot.

[snip]