lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCGgJiFDrZYpynCiHBnQx48A9RsAY9-6Hshduo4ymGJLQ@mail.gmail.com>
Date: Mon, 15 Jan 2024 09:21:46 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Qais Yousef <qyousef@...alina.io>
Cc: Wyes Karny <wkarny@...il.com>, Linus Torvalds <torvalds@...ux-foundation.org>, 
	Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org, 
	Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>
Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, 14 Jan 2024 at 20:58, Qais Yousef <qyousef@...alina.io> wrote:
>
> On 01/14/24 16:20, Vincent Guittot wrote:
> > On Sun, 14 Jan 2024 at 16:12, Qais Yousef <qyousef@...alina.io> wrote:
> > >
> > > On 01/14/24 14:03, Vincent Guittot wrote:
> > >
> > > > Thanks for the trace. It was really helpful and I think that I got the
> > > > root cause.
> > > >
> > > > The problem comes from get_capacity_ref_freq() which returns current
> > > > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > > > we apply map_util_perf() earlier in the path now which is then capped
> > > > by max capacity.
> > > >
> > > > Could you try the below ?
> > > >
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > index e420e2ee1a10..611c621543f4 100644
> > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > > > cpufreq_policy *policy)
> > > >         if (arch_scale_freq_invariant())
> > > >                 return policy->cpuinfo.max_freq;
> > > >
> > > > -       return policy->cur;
> > > > +       return policy->cur + policy->cur >> 2;
> > > >  }
> > > >
> > > >  /**
> > >
> > > Is this a test patch or a proper fix? I can't see it being the latter. It seems
> >
> > It's a proper fix. It's the same mechanism that is used already :
> >  - Either you add margin on the utilization to go above current freq
> > before it is fully used. This si what was done previously
> >  - or you add margin on the freq range to select a higher freq than
> > current one before it become fully used
>
> Aren't we applying the 25% headroom twice then?

In some cases, yes. But whereas the 1st one is generic and can be
filtered with uclamp, the 2nd one is there to enable platform without
frequency invariance to jump to the next OPP

>
> >
> > > the current logic fails when util is already 1024, and I think we're trying to
> > > fix the invariance issue too late.
> > >
> > > Is the problem that we can't read policy->cur in the scheduler to fix the util
> > > while it's being updated that's why it's done here in this case?
> > >
> > > If this is the problem, shouldn't the logic be if util is max then always go to
> > > max frequency? I don't think we have enough info to correct the invariance here
> > > IIUC. All we can see the system is saturated at this frequency and whether
> > > a small jump or a big jump is required is hard to tell.
> > >
> > > Something like this
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 95c3c097083e..473d0352030b 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >         struct cpufreq_policy *policy = sg_policy->policy;
> > >         unsigned int freq;
> > >
> > > -       freq = get_capacity_ref_freq(policy);
> > > -       freq = map_util_freq(util, freq, max);
> > > +       if (util != max) {
> > > +               freq = get_capacity_ref_freq(policy);
> > > +               freq = map_util_freq(util, freq, max);
> > > +       } else {
> > > +               freq = policy->cpuinfo.max_freq;
> > > +       }
> >
> > This is not correct because you will have to wait to reach full
> > utilization at the current OPP possibly the lowest OPP before moving
> > directly to max OPP
>
> Isn't this already the case? The ratio (util+headroom/max) will be less than
> 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> will request a frequency smaller than/equal policy->cur, no?

In your proposal, we stay at current OPP until util == 1024 because
util can't be higher than max. And then we switch directly to max_freq

In the previous behavior and with the fix, we can go to the next OPP
because either util can be higher than max (previous behavior) or the
selected freq can be higher than current (fix behavior) and we don't
have to want for util == max

>
> ie:
>
>         util = 600
>         max = 1024
>
>         freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
>
> (util+headroom/max) must be greater than 1 for us to start going above
> policy->cur - which seems to have been working by accident IIUC.

No this is not by accident

>
> So yes my proposal is incorrect, but it seems the conversion is not right to me
> now.
>
> I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
>
> 2.2GHz, 2.8GHz and 3.8GHz
>
> which (I believe) translates into capacities
>
> ~592, ~754, 1024
>
> which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> translates into util = ~473.

Keep in mind that util is the utilization at current OPP not at full
scale because we don't have frequency invariance.

>
> But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> attached picture), which is what you'd expect since we apply two headrooms now,
> which means the ratio (util+headroom/max) will be greater than 1 after go above
> this value
>
>         1024 * 0.8 * 0.8 = ~655
>
> So I think the math makes sense logically, but we're missing some other
> correction factor.
>
> When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> away. My test is simple busyloop via
>
>         cat /dev/zero > /dev/null
>
> I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> causing us to be slightly above 523 and that's why we start with max freq.
>
> Or I've done the math wrong :-) But the two don't behave the same for the same
> kernel with and without CPPC.

They will never behave the same because they can't
- with invariance, the utilization is the utilization at max capacity
so we can easily jump several OPP to go directly to the right one
- without invariance, the utilization is the utilization at current
OPP so we can only jump to a limited number of OPP

>
> >
> > >
> > >         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > >                 return sg_policy->next_freq;

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ