linux-kernel - Re: [PATCH v6] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240724212255.mfr2ybiv2j2uqek7@airbuntu>
Date: Wed, 24 Jul 2024 22:22:55 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Christian Loehle <christian.loehle@....com>,
	Hongyan Xia <hongyan.xia2@....com>,
	John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6] sched: Consolidate cpufreq updates

On 07/09/24 09:48, Vincent Guittot wrote:
> On Wed, 19 Jun 2024 at 22:14, Qais Yousef <qyousef@...alina.io> wrote:
> >
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> >
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> >
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> >
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> >
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> >
> > 1. context_switch()
> > 2. task_tick_fair()
> > 3. update_blocked_averages()
> > 4. on syscall that changes policy or uclamp values
> > 5. on check_preempt_wakeup_fair() if wakeup preemption failed
> 
> So this above is the new thing to take care of enqueues that generate
> sudden updates of util_est and could require a frequency change, isn't
> it ?

Yes.

> 
> >
> > The update at context switch should help guarantee that DL and RT get
> > the right frequency straightaway when they're RUNNING. As mentioned
> > though the update will happen slightly after enqueue_task(); though in
> > an ideal world these tasks should be RUNNING ASAP and this additional
> 
> That's probably ok for RT although it means possibly running up to the
> switch at lowest frequency but I suppose it's not a big concern as it
> is already the case if the RT task will end up on a different CPU than
> the local one.
> 
> I'm more concerned about DL tasks. cpu_bw_dl() reflects the min
> bandwidth/capacity to run all enqueued DL tasks in time. The dl
> bandwidth is updated when a DL task is enqueued and this
> bandwidth/capacity should be applied immediately and not at the next
> context switch otherwise you will not have enough bandwidth if the
> newly enqueued DL task does not preempt current DL task

Hmm. Yes.

I moved the DL updates to __add_running_bw() with the new FORCE_UDPATE flag to
ensure rate limit doesn't accidentally block it. No need for an update at
dequeue though as the context switch will handle that.

> > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> > +                                    unsigned int flags)
> >  {
> >         s64 delta_ns;
> >
> > +       delta_ns = time - sg_policy->last_freq_update_time;
> > +
> > +       /*
> > +        * We want to update cpufreq at context switch, but on systems with
> > +        * long TICK values, this can happen after a long time while more tasks
> > +        * would have been added meanwhile leaving us potentially running at
> > +        * inadequate frequency for extended period of time.
> > +        *
> > +        * This logic should only apply when new fair task was added to the
> > +        * CPU, we'd want to defer to context switch as much as possible, but
> > +        * to avoid the potential delays mentioned above, let's check if this
> > +        * additional tasks warrants sending an update sooner.
> > +        *
> > +        * We want to ensure there's at least an update every
> > +        * sysctl_sched_base_slice.
> > +        */
> > +       if (likely(flags & SCHED_CPUFREQ_TASK_ENQUEUED)) {
> > +               if (delta_ns < sysctl_sched_base_slice)
> 
> I'm not sure that this is the right condition. This value seems quite
> long to me and not that much different from a 4ms tick. Should we use
> the 1024us of the pelt

I thought about using 1ms, but opted for this. Comparing against NSEC_PER_MSEC
now. I think our base_slice should be 1ms by default, but maybe I am trying to
do too much in one go :)

> 
> Also, I run the use case that I ran previously and I have cases where
> we wait more than 4.5 ms between the enqueue of the big task and the
> freq update (with a 1ms tick period) so There are probably some corner
> cases which are not correctly handled.
> 
> My use case is 2 tasks running on the same cpu. 1 short task A running
> 500us with a period of 19457us and 1 long task B running 19000 us with
> a period of 139777us. The periods are set to never be in sync with the
> tick which is 1 ms. There are cases when task B wakes up while task A
> is already running and doesn't preempt A and the freq update happens
> only 4.5ms after task B wakes up and task A went back to sleep whereas
> we should switch immediatly from 760Mhz to 1958 Mhz

Thanks for that. From what I see we have two problems:

1. rq->cfs.decayed is not set to true after the new enqueue.
2. rate_limit_us can still block the update and there's nothing we can do
   aboutu this here.

I could reproduce the problem without my patch by the way.

When we fail to preempt, I force rq->cfs.decayed be true always now. Not sure
if we're hitting another bug or decayed can be false sometimes after an
enqueue.

> > @@ -614,6 +619,7 @@ int __sched_setscheduler(struct task_struct *p,
> >         int retval, oldprio, newprio, queued, running;
> >         const struct sched_class *prev_class;
> >         struct balance_callback *head;
> > +       bool update_cpufreq;
> >         struct rq_flags rf;
> >         int reset_on_fork;
> >         int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> > @@ -796,7 +802,8 @@ int __sched_setscheduler(struct task_struct *p,
> >                 __setscheduler_params(p, attr);
> >                 __setscheduler_prio(p, newprio);
> >         }
> > -       __setscheduler_uclamp(p, attr);
> > +
> > +       update_cpufreq = __setscheduler_uclamp(p, attr);
> >
> >         if (queued) {
> >                 /*
> > @@ -811,7 +818,14 @@ int __sched_setscheduler(struct task_struct *p,
> >         if (running)
> >                 set_next_task(rq, p);
> >
> > -       check_class_changed(rq, p, prev_class, oldprio);
> > +       update_cpufreq |= check_class_changed(rq, p, prev_class, oldprio);
> > +
> > +       /*
> > +        * Changing class or uclamp value implies requiring to send cpufreq
> > +        * update.
> > +        */
> > +       if (update_cpufreq && running)
> 
> Why running ? it should be queued as we are max aggregating

The new logic is to ignore changes at enqueue but defer them to context switch
when a task is RUNNING.

That said, we do actually send an update if we fail to preempt now at enqueue.
I will do something similar here for queued task to match the new logic.


Thanks!

--
Qais Yousef

> 
> > +               update_cpufreq_current(rq);
> >
> >         /* Avoid rq from going away on us: */
> >         preempt_disable();
> > --
> > 2.34.1
> >