linux-kernel - Re: [PATCH 2/2] cpufreq: schedutil: Ignore rate limit when scaling up with FIE present

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z1zqQaH6aoCsW3UL@sultan-box.localdomain>
Date: Fri, 13 Dec 2024 18:15:29 -0800
From: "Sultan Alsawaf (unemployed)" <sultan@...neltoast.com>
To: Christian Loehle <christian.loehle@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>, linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Beata Michalska <Beata.Michalska@....com>,
	Qais Yousef <qyousef@...alina.io>,
	Pierre Gondois <pierre.gondois@....com>,
	Lukasz Luba <lukasz.luba@....com>
Subject: Re: [PATCH 2/2] cpufreq: schedutil: Ignore rate limit when scaling
 up with FIE present

On Thu, Dec 12, 2024 at 12:06:20PM +0000, Christian Loehle wrote:
> On 12/12/24 01:57, Sultan Alsawaf wrote:
> > From: "Sultan Alsawaf (unemployed)" <sultan@...neltoast.com>
> 
> Hi Sultan,
> (Adding further CCs that might be interested in this)
> Good to see some input here again, if it becomes a regular thing,
> which I would welcome, you might want to look into git send-email
> or b4 relay, at least in my inbox this series looks strange.
> Also no cover letter.

Hi Christian,

Thank you for the encouraging words! :-)

Sorry about the strangeness. I knew better and should've sent a cover letter;
when I saw how the series looked on the list a few minutes after sending it, I
grimaced and realized my mistake. FWIW, I did use git send-email, but I hadn't
heard about b4; thanks for the tip!

> > 
> > When schedutil disregards a frequency transition due to the transition rate
> > limit, there is no guaranteed deadline as to when the frequency transition
> > will actually occur after the rate limit expires. For instance, depending
> > on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited
> > frequency transition may be delayed indefinitely, until said CPU reaches
> > the scheduler again. This also hurts tasks boosted via UCLAMP_MIN.
> 
> Realistically this will be the tick at worst for the systems you care about,
> I assume.

Typically, yes, especially so with PREEMPT_RT. It can get quite bad otherwise if
preempt/IRQs are disabled for a while, e.g. a common offender is zone->lock in
the page allocator.

> > 
> > For frequency transitions _down_, this only poses a theoretical loss of
> > energy savings since a CPU may remain at a higher frequency than necessary
> > for an indefinite period beyond the rate limit expiry.
> 
> theoretical?

I can't think of a realistic way to measure these lost energy savings, much less
reclaim them.

> > 
> > For frequency transitions _up_, however, this poses a significant hit to
> > performance when a CPU is stuck at an insufficient frequency for an
> > indefinitely long time. In latency-sensitive and bursty workloads
> > especially, a missed frequency transition up can result in a significant
> > performance loss due to a CPU operating at an insufficient frequency for
> > too long.
> 
> The term significant implies you have some numbers, please go ahead and
> share them then.

On a Pixel 9 with fast_switch (that I implemented myself), and AMU-driven FIE:

Test command:

	taskset 40 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched messaging -g 50

The last Cortex-A720 core in the PD of three A720 cores is tested, with
rate_limit_us set to 2000.

Before this patch, schedutil:
----------------------------
 Performance counter stats for 'perf bench sched messaging -g 50' (10 runs):

        9121062386      cycles                           #    2.541 GHz                      ( +-  1.17% )
       10863940366      instructions                     #    1.22  insn per cycle           ( +-  0.20% )
           3577.62 msec task-clock                       #    0.991 CPUs utilized            ( +-  0.82% )

            3.6108 +- 0.0294 seconds time elapsed  ( +-  0.81% )

After this patch, schedutil:
----------------------------
 Performance counter stats for 'perf bench sched messaging -g 50' (10 runs):

        8577233522      cycles                           #    2.538 GHz                      ( +-  0.73% )
       10514835388      instructions                     #    1.26  insn per cycle           ( +-  0.08% )
           3533.64 msec task-clock                       #    1.039 CPUs utilized            ( +-  0.67% )

            3.4005 +- 0.0238 seconds time elapsed  ( +-  0.70% )

With performance governor:
--------------------------
 Performance counter stats for 'perf bench sched messaging -g 50' (10 runs):

        8712516777      cycles                           #    2.653 GHz                      ( +-  0.84% )
       10543289262      instructions                     #    1.24  insn per cycle           ( +-  0.10% )
           3358.74 msec task-clock                       #    1.017 CPUs utilized            ( +-  0.84% )

            3.3028 +- 0.0283 seconds time elapsed  ( +-  0.86% )

There is an improvement of about 6% with schedutil after this patch. These
results are very consistent across several runs.

> I'm not sure if you're aware of Qais' series that also intends to ignore the
> rate-limit (in certain cases).
> https://lore.kernel.org/lkml/20240728184551.42133-1-qyousef@layalina.io/

I was unaware, thanks for the link.

I read through the cover letter and code, and my initial thought is that
task_tick_fair() is too slow to cover fair tasks' needs. Given that a PELT
period is ~1 ms, there can be several load updates in between ticks.

I also think that Qais' series permits too many frequency updates too quickly,
for example when switching from RT/DL to a fair task. On systems with a lot of
threaded IRQs (or PREEMPT_RT), I can imagine this results in the frequency
bouncing around a lot.

> I would mostly agree with your below argument regarding FIE and did lean
> towards dropping it altogether. The main issue I had with Qais' series
> was !fast_switch platforms and the resulting preemption by the DL task
> (Often on a remote, possibly idle CPU of the same perf-domain).
> Unlimited frequency updates can really hurt both power and throughput there.
> 
> Fortunately given the low-pass filter nature of PELT, without some special
> workloads, most calls to cpufreq_update_util() are dropped because there
> is no resulting frequency change anyway.
> 
> You keeping the rate-limit when reducing the frequency might be enough to
> work around the issue though. I will run some experiments like I did for
> Qais' series.

Yeah, my thinking is that always allowing updates _only_ when increasing the
frequency naturally bounds the number of possible updates in a given period. You
can't keep going up forever, unless you have a CPU with a ridiculously huge
number of discrete frequency steps. :-)

> I'll also go and check for unintended changes in iowait boost (that
> interacts with the rate-limit too).

Thanks!

> > 
> > When support for the Frequency Invariant Engine (FIE) _isn't_ present, a
> > rate limit is always required for the scheduler to compute CPU utilization
> > with some semblance of accuracy: any frequency transition that occurs
> > before the previous transition latches would result in the scheduler not
> > knowing the frequency a CPU is actually operating at, thereby trashing the
> > computed CPU utilization.
> > 
> > However, when FIE support _is_ present, there's no technical requirement to
> > rate limit all frequency transitions to a cpufreq driver's reported
> > transition latency. With FIE, the scheduler's CPU utilization tracking is
> > unaffected by any frequency transitions that occur before the previous
> > frequency is latched.
> > 
> > Therefore, ignore the frequency transition rate limit when scaling up on
> > systems where FIE is present. This guarantees that transitions to a higher
> > frequency cannot be indefinitely delayed, since they simply cannot be
> > delayed at all.
> > 
> > Signed-off-by: Sultan Alsawaf (unemployed) <sultan@...neltoast.com>
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 35 +++++++++++++++++++++++++++-----
> >  1 file changed, 30 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e51d5ce730be..563baab89a24 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -59,10 +59,15 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> >  
> >  /************************ Governor internals ***********************/
> >  
> > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +static bool sugov_should_rate_limit(struct sugov_policy *sg_policy, u64 time)
> >  {
> > -	s64 delta_ns;
> > +	s64 delta_ns = time - sg_policy->last_freq_update_time;
> > +
> > +	return delta_ns < sg_policy->freq_update_delay_ns;
> > +}
> >  
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +{
> >  	/*
> >  	 * Since cpufreq_update_util() is called with rq->lock held for
> >  	 * the @target_cpu, our per-CPU data is fully serialized.
> > @@ -87,17 +92,37 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >  		return true;
> >  	}
> >  
> > -	delta_ns = time - sg_policy->last_freq_update_time;
> > +	/*
> > +	 * When frequency-invariant utilization tracking is present, there's no
> > +	 * rate limit when increasing frequency. Therefore, the next frequency
> > +	 * must be determined before a decision can be made to rate limit the
> > +	 * frequency change, hence the rate limit check is bypassed here.
> > +	 */
> > +	if (arch_scale_freq_invariant())
> > +		return true;
> >  
> > -	return delta_ns >= sg_policy->freq_update_delay_ns;
> > +	return !sugov_should_rate_limit(sg_policy, time);
> >  }
> >  
> >  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> >  				   unsigned int next_freq)
> >  {
> > +	/*
> > +	 * When a frequency update isn't mandatory (!need_freq_update), the rate
> > +	 * limit is checked again upon frequency reduction because systems with
> > +	 * frequency-invariant utilization bypass the rate limit check entirely
> > +	 * in sugov_should_update_freq(). This is done so that the rate limit
> > +	 * can be applied only for frequency reduction when frequency-invariant
> > +	 * utilization is present. Now that the next frequency is known, the
> > +	 * rate limit can be selectively applied to frequency reduction on such
> > +	 * systems. A check for arch_scale_freq_invariant() is omitted here
> > +	 * because unconditionally rechecking the rate limit is cheaper.
> > +	 */
> >  	if (sg_policy->need_freq_update)
> >  		sg_policy->need_freq_update = false;
> > -	else if (sg_policy->next_freq == next_freq)
> > +	else if (next_freq == sg_policy->next_freq ||
> > +		 (next_freq < sg_policy->next_freq &&
> > +		  sugov_should_rate_limit(sg_policy, time)))
> >  		return false;
> >  
> >  	sg_policy->next_freq = next_freq;
> 
> Code seems to match your description FWIW.
> Maybe the comments could be trimmed somewhat.

How's this for the sugov_update_next_freq() comment?
	/*
	 * When a frequency update isn't mandatory, the rate limit is checked
	 * again upon frequency reduction because systems with frequency
	 * invariance bypass the rate limit check in sugov_should_update_freq().
	 * This is done so that the rate limit can be applied only for frequency
	 * reduction. A check for arch_scale_freq_invariant() is omitted here
	 * because unconditionally rechecking the rate limit is cheaper.
	 */

> Regards,
> Christian

Thanks,
Sultan