[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z1zqQaH6aoCsW3UL@sultan-box.localdomain>
Date: Fri, 13 Dec 2024 18:15:29 -0800
From: "Sultan Alsawaf (unemployed)" <sultan@...neltoast.com>
To: Christian Loehle <christian.loehle@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org,
Beata Michalska <Beata.Michalska@....com>,
Qais Yousef <qyousef@...alina.io>,
Pierre Gondois <pierre.gondois@....com>,
Lukasz Luba <lukasz.luba@....com>
Subject: Re: [PATCH 2/2] cpufreq: schedutil: Ignore rate limit when scaling
up with FIE present
On Thu, Dec 12, 2024 at 12:06:20PM +0000, Christian Loehle wrote:
> On 12/12/24 01:57, Sultan Alsawaf wrote:
> > From: "Sultan Alsawaf (unemployed)" <sultan@...neltoast.com>
>
> Hi Sultan,
> (Adding further CCs that might be interested in this)
> Good to see some input here again, if it becomes a regular thing,
> which I would welcome, you might want to look into git send-email
> or b4 relay, at least in my inbox this series looks strange.
> Also no cover letter.
Hi Christian,
Thank you for the encouraging words! :-)
Sorry about the strangeness. I knew better and should've sent a cover letter;
when I saw how the series looked on the list a few minutes after sending it, I
grimaced and realized my mistake. FWIW, I did use git send-email, but I hadn't
heard about b4; thanks for the tip!
> >
> > When schedutil disregards a frequency transition due to the transition rate
> > limit, there is no guaranteed deadline as to when the frequency transition
> > will actually occur after the rate limit expires. For instance, depending
> > on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited
> > frequency transition may be delayed indefinitely, until said CPU reaches
> > the scheduler again. This also hurts tasks boosted via UCLAMP_MIN.
>
> Realistically this will be the tick at worst for the systems you care about,
> I assume.
Typically, yes, especially so with PREEMPT_RT. It can get quite bad otherwise if
preempt/IRQs are disabled for a while, e.g. a common offender is zone->lock in
the page allocator.
> >
> > For frequency transitions _down_, this only poses a theoretical loss of
> > energy savings since a CPU may remain at a higher frequency than necessary
> > for an indefinite period beyond the rate limit expiry.
>
> theoretical?
I can't think of a realistic way to measure these lost energy savings, much less
reclaim them.
> >
> > For frequency transitions _up_, however, this poses a significant hit to
> > performance when a CPU is stuck at an insufficient frequency for an
> > indefinitely long time. In latency-sensitive and bursty workloads
> > especially, a missed frequency transition up can result in a significant
> > performance loss due to a CPU operating at an insufficient frequency for
> > too long.
>
> The term significant implies you have some numbers, please go ahead and
> share them then.
On a Pixel 9 with fast_switch (that I implemented myself), and AMU-driven FIE:
Test command:
taskset 40 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched messaging -g 50
The last Cortex-A720 core in the PD of three A720 cores is tested, with
rate_limit_us set to 2000.
Before this patch, schedutil:
----------------------------
Performance counter stats for 'perf bench sched messaging -g 50' (10 runs):
9121062386 cycles # 2.541 GHz ( +- 1.17% )
10863940366 instructions # 1.22 insn per cycle ( +- 0.20% )
3577.62 msec task-clock # 0.991 CPUs utilized ( +- 0.82% )
3.6108 +- 0.0294 seconds time elapsed ( +- 0.81% )
After this patch, schedutil:
----------------------------
Performance counter stats for 'perf bench sched messaging -g 50' (10 runs):
8577233522 cycles # 2.538 GHz ( +- 0.73% )
10514835388 instructions # 1.26 insn per cycle ( +- 0.08% )
3533.64 msec task-clock # 1.039 CPUs utilized ( +- 0.67% )
3.4005 +- 0.0238 seconds time elapsed ( +- 0.70% )
With performance governor:
--------------------------
Performance counter stats for 'perf bench sched messaging -g 50' (10 runs):
8712516777 cycles # 2.653 GHz ( +- 0.84% )
10543289262 instructions # 1.24 insn per cycle ( +- 0.10% )
3358.74 msec task-clock # 1.017 CPUs utilized ( +- 0.84% )
3.3028 +- 0.0283 seconds time elapsed ( +- 0.86% )
There is an improvement of about 6% with schedutil after this patch. These
results are very consistent across several runs.
> I'm not sure if you're aware of Qais' series that also intends to ignore the
> rate-limit (in certain cases).
> https://lore.kernel.org/lkml/20240728184551.42133-1-qyousef@layalina.io/
I was unaware, thanks for the link.
I read through the cover letter and code, and my initial thought is that
task_tick_fair() is too slow to cover fair tasks' needs. Given that a PELT
period is ~1 ms, there can be several load updates in between ticks.
I also think that Qais' series permits too many frequency updates too quickly,
for example when switching from RT/DL to a fair task. On systems with a lot of
threaded IRQs (or PREEMPT_RT), I can imagine this results in the frequency
bouncing around a lot.
> I would mostly agree with your below argument regarding FIE and did lean
> towards dropping it altogether. The main issue I had with Qais' series
> was !fast_switch platforms and the resulting preemption by the DL task
> (Often on a remote, possibly idle CPU of the same perf-domain).
> Unlimited frequency updates can really hurt both power and throughput there.
>
> Fortunately given the low-pass filter nature of PELT, without some special
> workloads, most calls to cpufreq_update_util() are dropped because there
> is no resulting frequency change anyway.
>
> You keeping the rate-limit when reducing the frequency might be enough to
> work around the issue though. I will run some experiments like I did for
> Qais' series.
Yeah, my thinking is that always allowing updates _only_ when increasing the
frequency naturally bounds the number of possible updates in a given period. You
can't keep going up forever, unless you have a CPU with a ridiculously huge
number of discrete frequency steps. :-)
> I'll also go and check for unintended changes in iowait boost (that
> interacts with the rate-limit too).
Thanks!
> >
> > When support for the Frequency Invariant Engine (FIE) _isn't_ present, a
> > rate limit is always required for the scheduler to compute CPU utilization
> > with some semblance of accuracy: any frequency transition that occurs
> > before the previous transition latches would result in the scheduler not
> > knowing the frequency a CPU is actually operating at, thereby trashing the
> > computed CPU utilization.
> >
> > However, when FIE support _is_ present, there's no technical requirement to
> > rate limit all frequency transitions to a cpufreq driver's reported
> > transition latency. With FIE, the scheduler's CPU utilization tracking is
> > unaffected by any frequency transitions that occur before the previous
> > frequency is latched.
> >
> > Therefore, ignore the frequency transition rate limit when scaling up on
> > systems where FIE is present. This guarantees that transitions to a higher
> > frequency cannot be indefinitely delayed, since they simply cannot be
> > delayed at all.
> >
> > Signed-off-by: Sultan Alsawaf (unemployed) <sultan@...neltoast.com>
> > ---
> > kernel/sched/cpufreq_schedutil.c | 35 +++++++++++++++++++++++++++-----
> > 1 file changed, 30 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e51d5ce730be..563baab89a24 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -59,10 +59,15 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> >
> > /************************ Governor internals ***********************/
> >
> > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +static bool sugov_should_rate_limit(struct sugov_policy *sg_policy, u64 time)
> > {
> > - s64 delta_ns;
> > + s64 delta_ns = time - sg_policy->last_freq_update_time;
> > +
> > + return delta_ns < sg_policy->freq_update_delay_ns;
> > +}
> >
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +{
> > /*
> > * Since cpufreq_update_util() is called with rq->lock held for
> > * the @target_cpu, our per-CPU data is fully serialized.
> > @@ -87,17 +92,37 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > return true;
> > }
> >
> > - delta_ns = time - sg_policy->last_freq_update_time;
> > + /*
> > + * When frequency-invariant utilization tracking is present, there's no
> > + * rate limit when increasing frequency. Therefore, the next frequency
> > + * must be determined before a decision can be made to rate limit the
> > + * frequency change, hence the rate limit check is bypassed here.
> > + */
> > + if (arch_scale_freq_invariant())
> > + return true;
> >
> > - return delta_ns >= sg_policy->freq_update_delay_ns;
> > + return !sugov_should_rate_limit(sg_policy, time);
> > }
> >
> > static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > unsigned int next_freq)
> > {
> > + /*
> > + * When a frequency update isn't mandatory (!need_freq_update), the rate
> > + * limit is checked again upon frequency reduction because systems with
> > + * frequency-invariant utilization bypass the rate limit check entirely
> > + * in sugov_should_update_freq(). This is done so that the rate limit
> > + * can be applied only for frequency reduction when frequency-invariant
> > + * utilization is present. Now that the next frequency is known, the
> > + * rate limit can be selectively applied to frequency reduction on such
> > + * systems. A check for arch_scale_freq_invariant() is omitted here
> > + * because unconditionally rechecking the rate limit is cheaper.
> > + */
> > if (sg_policy->need_freq_update)
> > sg_policy->need_freq_update = false;
> > - else if (sg_policy->next_freq == next_freq)
> > + else if (next_freq == sg_policy->next_freq ||
> > + (next_freq < sg_policy->next_freq &&
> > + sugov_should_rate_limit(sg_policy, time)))
> > return false;
> >
> > sg_policy->next_freq = next_freq;
>
> Code seems to match your description FWIW.
> Maybe the comments could be trimmed somewhat.
How's this for the sugov_update_next_freq() comment?
/*
* When a frequency update isn't mandatory, the rate limit is checked
* again upon frequency reduction because systems with frequency
* invariance bypass the rate limit check in sugov_should_update_freq().
* This is done so that the rate limit can be applied only for frequency
* reduction. A check for arch_scale_freq_invariant() is omitted here
* because unconditionally rechecking the rate limit is cheaper.
*/
> Regards,
> Christian
Thanks,
Sultan
Powered by blists - more mailing lists