[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBWLe4hMBhJeSqvoW10dAF3Bgj+zcYGMgBfwUhkgytkEQ@mail.gmail.com>
Date: Tue, 13 Aug 2024 10:25:39 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Christian Loehle <christian.loehle@....com>
Cc: Qais Yousef <qyousef@...alina.io>, "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>, Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Hongyan Xia <hongyan.xia2@....com>,
John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7] sched: Consolidate cpufreq updates
On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@....com> wrote:
>
> On 7/28/24 19:45, Qais Yousef wrote:
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> >
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> >
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> >
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> >
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> >
> > 1. context_switch()
> > 2. task_tick_fair()
> > 3. sched_balance_update_blocked_averages()
> > 4. on sched_setscheduler() syscall that changes policy or uclamp values
> > 5. on check_preempt_wakeup_fair() if wakeup preemption failed
> > 6. on __add_running_bw() to guarantee DL bandwidth requirements.
> >
> > The update at context switch should help guarantee that RT get the right
> > frequency straightaway when they're RUNNING. As mentioned though the
> > update will happen slightly after enqueue_task(); though in an ideal
> > world these tasks should be RUNNING ASAP and this additional delay
> > should be negligible. For fair tasks we need to make sure we send
> > a single update for every decay for the root cfs_rq. Any changes to the
> > rq will be deferred until the next task is ready to run, or we hit TICK.
> > But we are guaranteed the task is running at a level that meets its
> > requirements after enqueue.
> >
> > To guarantee RT and DL tasks updates are never missed, we add a new
> > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> > already running at the right freq, the governor will end up doing
> > nothing, but we eliminate the risk of the task ending up accidentally
> > running at the wrong freq due to rate_limit_us.
> >
> > Similarly for iowait boost, we ignore rate limits. We also handle a case
> > of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> > to reduce the boost after 1ms which seems iowait boost mechanism relied
> > on rate_limit_us and cfs_rq.decayed preventing any updates to happen
> > soon after iowait boost.
> >
> > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> > time stamps otherwise we can end up delaying updates for normal
> > requests.
>
> Hi Qais,
> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming
> freq updates still bothered me so let me share my thoughts even though
> it might be niche enough for us not to care.
>
> 1. On fast_switch systems, assuming they are fine with handling the
> actual updates, we have a bit more work on each context_switch() and
> some synchronisation, too. That should be fine, if anything there's
> some performance regression in a couple of niche cases.
>
> 2. On !fast_switch systems this gets more interesting IMO. So we have
> a sugov DEADLINE task wakeup for every (in a freq-diff resulting)
> update request. This task will preempt whatever and currently will
> pretty much always be running on the CPU it ran last on (so first CPU
> of the PD).
The !fast_switch is a bit of concern for me too but not for the same
reason and maybe the opposite of yours IIUC your proposal below:
With fast_switch we have the following sequence:
sched_switch() to task A
cpufreq_driver_fast_switch -> write new freq target
run task A
This is pretty straight forward but we have the following sequence
with !fast_switch
sched_switch() to task A
queue_irq_work -> raise an IPI on local CPU
Handle IPI -> wakeup and queue sugov dl worker on local CPU (always
with 1 CPU per PD)
sched_switch() to sugov dl task
__cpufreq_driver_target() which can possibly block on a lock
sched_switch() to task A
run task A
We can possibly have 2 context switch and one IPi for each "normal"
context switch which is not really optimal
>
> The weirdest case I can think of right now is two FAIR iowait tasks on
> e.g. CPU1 keep waking up the DEADLINE task on CPU0 (same PD) regardless
> of what is running there.
> Potentially that means two fair tasks on one CPU CPU-starving an RT
> task on another CPU, because it keeps getting preempted by the DEADLINE
> sugov worker.
> For this to actually happen we need to ensure the tasks
> context-switching actually results in a different requested frequency
> every time, which is a bit unlikely without UCLAMP_MAX, let's say task
> A has 512, task B 1024, task C (RT on CPU1 should have uclamp_min<=512
> then too otherwise frequency may be dictated by the RT task anyway.)
> (Note the entire thing also works with Tasks A & B being lower-prio RT
> too, instead of FAIR and iowait.)
>
> Note that due to the nature of SCHED_DEADLINE and the sugov task having
> 10s period and 1s runtime this behavior is limited to 1s every 10s.
> The remaining 9s (replenishment time) we won't see any cpufreq updates
> for that PD at all though.
>
> To reproduce, I have [4,5] being one PD:
>
> fio --minimal --time_based --name=cpu5_1024uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5
> uclampset -M 512 fio --minimal --time_based --name=cpu5_512uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5
> and then your RT task on CPU4.
>
> Something like this would mitigate that, I think it makes sense even
> without your patch to get a more predictable idle pattern, just maybe
> not exactly this patch of course.
>
> -->8--
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 64f614b3db20..c186f8f999fe 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -567,6 +567,8 @@ static void sugov_irq_work(struct irq_work *irq_work)
>
> sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
>
> + /* Try to wake the task here to not preempt or wake up another CPU. */
> + sg_policy->worker.task->wake_cpu = smp_processor_id();
> kthread_queue_work(&sg_policy->worker, &sg_policy->work);
> }
>
>
>
>
Powered by blists - more mailing lists