linux-kernel - Re: [PATCH v6 2/2] cpuidle: teo: Introduce util-awareness

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240529102337.ddap2ej7pbxurgax@airbuntu>
Date: Wed, 29 May 2024 11:23:37 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Kajetan Puchalski <kajetan.puchalski@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>, rafael@...nel.org,
	daniel.lezcano@...aro.org, lukasz.luba@....com,
	Dietmar.Eggemann@....com, dsmythies@...us.net,
	yu.chen.surf@...il.com, linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
	Ulf Hansson <ulf.hansson@...aro.org>
Subject: Re: [PATCH v6 2/2] cpuidle: teo: Introduce util-awareness

On 05/28/24 13:12, Kajetan Puchalski wrote:
> Hi Vincent,
> 
> On Tue, May 28, 2024 at 11:29:02AM +0200, Vincent Guittot wrote:
> > Hi All,
> > 
> > I'm quite late on this thread but this patchset creates a major
> > regression for psci cpuidle driver when using the OSI mode (OS
> > initiated mode).  In such a case, cpuidle driver takes care only of
> > CPUs power state and the deeper C-states ,which includes cluster and
> > other power domains, are handled with power domain framework. In such
> > configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> > and others states that include the clusters, are managed by genpd and
> > its governor.
> > 
> > This patch selects cpuidle c-state N-1 as soon as the utilization is
> > above CPU capacity / 64 which means at most a level of 16 on the big
> > core but can be as low as 4 on little cores. These levels are very low
> > and the main result is that as soon as there is very little activity
> > on a CPU, cpuidle always selects WFI states whatever the estimated
> > sleep duration and which prevents any deeper states. Another effect is
> > that it also keeps the tick firing every 1ms in my case.
> > 
> > IMO, we should at least increase the utilization level
> 
> I think you're most likely right on this, the reason why I ended up
> leaving the threshold at cap/64 was that at cap/32 it would be too high
> for the mechanism to actually have any effects on the device I was
> testing this on.
> 
> The issue then of course is that if you tailor the threshold to little
> cores it becomes too high for big cores, conversely if you tailor it to
> big cores it becomes too low for small ones.
> 
> We could get around this by making sure the threshold doesn't end up
> being lower than a certain capacity-independent minimum, how does that sound?
> 
> cpu_data->util_threshold = max(MIN_THRESHOLD, max_capacity >> UTIL_THRESHOLD_SHIFT);
> 
> And we set MIN_THRESHOLD to something like 10, 15 or 20. Not sure which
> one would be more appropriate but I think this could help alleviate some
> issues with the mechanism being too aggressive.

I'm afraid this is not a solution. The whole threshold logic doesn't work I am
afraid. And there's no magic number that we can come up with to make it work.
The whole approach needs rethinking.

> > 
> > Regards,
> > Vincent
> > 
> > On Sun, 17 Sept 2023 at 03:05, Qais Yousef <qyousef@...alina.io> wrote:
> > >
> > > Hi Kajetan
> > >
> > > On 07/18/23 14:24, Qais Yousef wrote:
> > >
> > > > These patches are in GKI. So we'll if there are uncaught problems I guess :)
> > > >
> > > > No appetite for a knob, but the very low value for littles did strike me and
> > > > thought I better ask at least. Today's littles are too tiny for their own good
> > > > and it seemed the threshold could end up being too aggressive especially in low
> > > > activity state. You effectively are saying that if we have few 100us of
> > > > activity, normal TEO predictions based on timers are no good and better to stay
> > > > shallower anyway.
> > > >
> > > > Note that due to NOHZ, if we go to idle for an extended period the util value
> > > > might not decay for a while and miss some opportunities. Especially that when
> > > > it next wakes up, it's enough for this wake up to run for few 100s us to block
> > > > a deeper state before going back to sleep for extended period of time.
> > > >
> > > > But we shall see. I got the answer I was looking for for now.
> > >
> > > Unfortunately not too long after the patches got merged I got regression report
> > > of worse power. As you know on Android things are not as mainline, so I need to
> > > untangle this to make sure it's not a red herring. But if you want to take my
> > > word for it, I think the chances of it being a true regression is high. I had
> > > to introduce knobs to allow controlling the thresholds for now, so the good
> > > news they do help and it's not a total revert. I don't have a lot of info to
> > > share, but it's the low activity use cases that seem to got impacted. Like
> > > video playback for instance.
> > >
> > > Generally, I'm trying to remove some hardcoded values from the scheduler that
> > > enforces a behavior that is not universally desired on all systems/workloads.
> > > And I think the way the util awareness threshold are done today fall into the
> > > same category.
> > >
> > > As I tried to highlight before, it is easy to trick the threshold by a task
> > > that runs for a short time then goes back to sleep for a long time.
> > >
> > > And when the system runs full throttle for a while, it'll take around 150+ms
> > > for the util to decay to the threshold value. That's a long time to block
> > > entering deeper idle states for. I'm not sure how NOHZ and blocked averaged
> > > updates can make this potentially worse.
> > >
> > > In my view, the absolute comparison against util can be misleading. Even when
> > > util is 512 for example, we still have 50% of idle time. How this time is
> > > distributed can't be known from util alone. It could be one task waking up and
> > > sleeping. It could be multiple tasks at many combination of patterns all
> > > leading to the same outcome of CPU util being 512.
> > >
> > > IIUC the idea is that if we have even small activity, then erring on the
> > > shallow side is better. But given that target-residency is usually in few ms
> > > range, do we really need to be that quite? With a target-residency of 3ms for
> > > example, even at util of 900 there can be opportunities to enter it.
> > >
> > > Can't we instead sample util at entry to idle loop and see if it is on a rising
> > > or falling trend? When rising it makes sense to say there's demand, let's block
> > > deeper idle state. But if it is falling, then if the decay time is longer than
> > > target-residency we can say it's okay to permit the deeper idle states?
> > >
> > > I need to think more about this; but I think it's worth trying to make these
> > > thresholds more deterministic and quantifiable. There are too many workloads
> > > and system variations. I'm not sure if a knob to control these thresholds is
> > > good for anything but a workaround like I had to do. These hardcoded values
> > > can be improved IMHO. Happy to help to find alternatives.
> > >
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef