linux-kernel - Re: [PATCH v2] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtD2moiyBprY=YLOE0N599=O8L2=UPFnmpsZ6Mc=+C__6w@mail.gmail.com>
Date: Mon, 13 May 2024 10:49:15 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Qais Yousef <qyousef@...alina.io>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Viresh Kumar <viresh.kumar@...aro.org>, 
	Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>, 
	Mel Gorman <mgorman@...e.de>, Daniel Bristot de Oliveira <bristot@...hat.com>, 
	Valentin Schneider <vschneid@...hat.com>, Christian Loehle <christian.loehle@....com>, linux-pm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched: Consolidate cpufreq updates

On Thu, 9 May 2024 at 14:40, Qais Yousef <qyousef@...alina.io> wrote:
>
> On 05/07/24 14:53, Vincent Guittot wrote:
> > On Tue, 7 May 2024 at 13:08, Qais Yousef <qyousef@...alina.io> wrote:
> > >
> > > On 05/07/24 10:58, Vincent Guittot wrote:
> > > > On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@...alina.io> wrote:
> > > > >
> > > > > Improve the interaction with cpufreq governors by making the
> > > > > cpufreq_update_util() calls more intentional.
> > > > >
> > > > > At the moment we send them when load is updated for CFS, bandwidth for
> > > > > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > > > > sent in a short period of time and potentially be ignored at a critical
> > > > > moment due to the rate_limit_us in schedutil.
> > > > >
> > > > > For example, simultaneous task enqueue on the CPU where 2nd task is
> > > > > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > > > > the first task will lead to dropping the 2nd request until tick. Or
> > > > > another CPU in the same policy triggers a freq update shortly after.
> > > > >
> > > > > Updates at enqueue for RT are not strictly required. Though they do help
> > > > > to reduce the delay for switching the frequency and the potential
> > > > > observation of lower frequency during this delay. But current logic
> > > > > doesn't intentionally (at least to my understanding) try to speed up the
> > > > > request.
> > > > >
> > > > > To help reduce the amount of cpufreq updates and make them more
> > > > > purposeful, consolidate them into these locations:
> > > > >
> > > > > 1. context_switch()
> > > >
> > > > I don't see any cpufreq update when switching from idle to CFS. We
> > >
> > > You mean SCHED_IDLE to SCHED_NORMAL, right? Yes, if we switch policies even
> > > from fair to RT an update could be missed.
> >
> > No I mean going out of idle. On an idle cpu, nothing happens at CFS
> > task wakeup and we have to wait for the next tick to apply the new
> > freq. This happens for both short task with uclamp min or long
> > running/sleeping task (i.e. with high util_est)
>
> And without my patch you see a freq change? If no stats were updated to cause

Yes, the behavior is correct without your patch

> a decay, we will skip the cpufreq update at this context switch.
>
> I'll audit the code again in case I missed a place where there's a decay. You
> could be hitting a race condition with update_blocked_avg() sending a cpufreq
> update and this could cause the context switch cpufreq update to be dropped by
> rate limit.. You could try to reduce your rate_limit_us to see if this helpsits

I already checked this. There is no update happening before the
wakeup. In fact I thought it was linked to a 1000 hz tick and the cpu
not stopping its tick but the problem is still there even after
removing such potential problems.

>
> I'll try to reproduce and investigate. FWIW, I did test this on M1 mac mini and
> pixel device running speedometer and some iowait workloads and didn't observe
> problems.

I'm using a dragonboard RB5 and I have a simple use case with a task
waking up periodically around 100ms for running 40ms.  As usual I'm
not using a strict 100ms to stay unsync with tick and cover more cases

>
> >
> > >
> > > I'll need to think more about it, but I think adding an update when we switch
> > > policies in the syscall looks sufficient to me, if the task is on rq already.
> > > Agreed?
> > >
> > > > have to wait for the next tick to get a freq update whatever the value
> > > > of util_est and uclamp
> > > >
> > > > > 2. task_tick_fair()
> > > >
> > > > Updating only during tick is ok with a tick at 1000hz/1000us when we
> > > > compare it with the1048us slice of pelt but what about 4ms or even
> > > > 10ms tick ? we can have an increase of almost 200 in 10ms
> > >
> > > IMHO the current code can still fail with these setups to update frequencies in
> > > time. If there's a single task on the rq, then the only freq update will happen
> > > at tick. So this is an existing problem.
> >
> > But any newly enqueued task can trigger a freq update without waiting
> > 1/4/10ms whereas we need to wait for next tick with this patch
>
> But it is racy. By deferring the decision (sampling point) we ensure better the
> RUNNING task is reflecting the current state of the rq taken into account any
> past events.
>
> Note if there's no enqueue/dequeue, the problem is not fixed either way. If we
> get two consecutive enqueues, the 2nd one will be dropped. And we'll end up
> with delays.
>
> I think this way we'd be just more consistently failing or working.
>
> Systems with high rate_limit_us or TICK generally should ask for generous
> headroom and I think this is the best way to address this issue in a scalable
> way. People with fast systems/configurations can be more exact and frequent in
> their requests. Systems/configurations that are slow will tend to exaggerate
> each request to cater for the slow response. But the requests themselves are
> done at better defined points of time. That's my hope at least, so I appreciate
> the reviews :)
>
> My other hope is that by doing the sampling at context switch we can better
> handle uclamp and iowait boost requests which requires special cpufreq
> constrains to be applied for this specifically RUNNING task. Ultimately leading
> to removing uclamp max() aggregation at enqueue/dequeue.
>
> >
> > >
> > > The way I see it is that setting such high TICK values implies low
> > > responsiveness by definition. So the person who selects this setup needs to
> > > cater that their worst case scenario is that and be happy with it. And this
> > > worst case scenario does not change.
> > >
> > > That said, the right way to cater for this is via my other series to remove the
> > > magic margins. DVFS headroom should rely on TICK value to ensure we run at
> > > adequate frequency until the next worst case scenario update, which relies on
> > > TICK. Which is sufficient to handle util_est changes. See below for uclamp.
> > >
> > > Wake up preemption should cause context switches to happen sooner than a tick
> > > too as we add more tasks on the rq. So I think the worst case scenario is not
> > > really changing that much. In my view, it's better to be consistent about the
> > > behavior.
> > >
> > > >
> > > > > 3. {attach, detach}_entity_load_avg()
> > > >
> > > > At enqueue/dequeue, the util_est will be updated and can make cpu
> > > > utilization quite different especially with long sleeping tasks. The
> > > > same applies for uclamp_min/max hints of a newly enqueued task. We
> > > > might end up waiting 4/10ms depending of the tick period.
> > >
> > > uclamp_min is a property of the task. And waiting for the task that needs the
> > > boost to run is fine IMHO. And I am actually hoping to remove uclamp max()
> >
> > But you will delay all CPU work and the running time fo the task
>
> uclamp_min shouldn't cause other tasks to run faster? From my perspective this
> is wasting power actually as what we want is only this task to run faster.
>
> If the task wants better wake up latency (which I assume what you're referring
> to by making things run faster then this task should get to RUNNING faster)
> then we need to annotate this separately IMHO. We shouldn't rely on the boost
> which is there to ensure this tasks gets work done at an acceptable rate, not
> to impact its wake up latency.
>
> >
> > And what about util_est ?
>
> Hmm yeah this won't cause cfs_rq.decayed to trigger so we could miss it at
> context switch. And could be what's causing the issue you're seeing above.