linux-kernel - Re: [GIT PULL] sched_ext: Initial pull request for v6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240801155233.navvwritakzabylg@airbuntu>
Date: Thu, 1 Aug 2024 16:52:33 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Russell Haley <yumpusamongus@...il.com>
Cc: ast@...nel.org, linux-kernel@...r.kernel.org, mingo@...hat.com,
	peterz@...radead.org, tj@...nel.org, torvalds@...ux-foundation.org,
	vincent.guittot@...aro.org, void@...ifault.com
Subject: Re: [GIT PULL] sched_ext: Initial pull request for v6.11

On 07/31/24 21:50, Russell Haley wrote:
> > We really shouldn't change how schedutil works. The governor is supposed to
> > behave in a certain way, and we need to ensure consistency. I think you should
> > look on how you make your scheduler compatible with it. Adding hooks to say
> > apply this perf value that I want is a recipe for randomness.
> 
> If schedutil's behavior is perfect as-is, then why does cpu.uclamp.max
> not work with values between 81-100%, which is the part of the CPU
> frequency range where one pays the least in performance per Joule saved?

I think you're referring to this problem

	https://lore.kernel.org/lkml/20230820210640.585311-1-qyousef@layalina.io/

which lead to Vincent schedutil rework to how estimation and constraints are
applied

	9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

Do you see the problem with this applied?

> Why does cpu.uclamp.min have to be set all the way up and down the
> cgroup hierarchy, from root to leaf, to actually affect frequency

I'm not aware of this problem.

This has nothing to do with schedutil. It's probably a bug in uclamp
aggregation.

Which kernel version are you on? Is this reproducible on mainline?

> selection? Why is sugov notorious for harming video encoding
> performance[1], which is a CPU-saturating workload? Why do intel_pstate

Without analysing the workload on that particular system, it's hard to know.

But I am aware of issues with rate_limit_us being high. I already proposed
enhancement and pursuing further improvement [1]. rate_limit_us defaulted to
10ms on many systems, which has slow reaction time for workloads that cause
utilization signal to go up and down.

If the workload is cpu-saturated with no idle time at all then we should run at
max frequency except for some initial rampup delay. If that's not the case it'd
be good to know. But we have no info to tell. My suspicion is that it's bursty
and goes up and down.

I will give a talk at LPC about issues with how util signal ramps up. And had
some patches to discuss overall response time issues in general [2]. I am also
trying to consolidate how cpufreq updates are driven by the scheduler to better
reflect the state of the CPU [3] and hopefully pave the way for better handling
perf constraints like uclamp and iowait boost (if we ever move to a per-task
iowait boost) [4].

There are a lot of teething issues to consider though.. So yeah, not perfect
and needs lots of improvements. And there's effort to improve it. I'm
interesting to learn about your problems so please feel to start a separate
thread on the problems you have. We can work to address them at least.

[1] https://lore.kernel.org/lkml/20240728192659.58115-1-qyousef@layalina.io/
[2] https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/
[3] https://lore.kernel.org/lkml/20240728184551.42133-1-qyousef@layalina.io/
[4] https://lore.kernel.org/lkml/20231208015242.385103-1-qyousef@layalina.io/

> and amd-pstate both bypass it on modern hardware?

I did try to raise it in the past but no one has carried out the work.
Volunteers are welcome. It's on my todo, but way down the list.

> 
> It appears that without Android's very deeply integrated userspace
> uclamp controls telling sugov what to do, it's native behavior is less

Nothing is deeply integrated about a syscall to change uclamp. But there are
more folks interested in creating libraries in that world and push to make it
work. I sadly don't see similar interest for Desktop and servers. Except for
Asahi Linux in one of their blog posts. But not sure if they plan something
more sophisticated than what they did.

Generally we don't create these userspace libraries, and there have been
several suggestions for us to start a library to help kick off that effort. But
none of us have the time. If there are folks interested to start this work, I'd
be happy to provide guidance.

> than awe-inspring. Futhermore, uclamp doesn't work especially well on
> systems that violate the big.LITTLE assumption that only clamping << max

I'm not sure what you mean here, but the power curve is different for every
system.

I'd expect users to try to set uclamp_max as low as possible without impacting
their perf. But there's a known problem with uclamp_max due to aggregation that
affects it's effectiveness that has been a topic of discussion in OSPM and LPC
several times. My proposal in [4] should address it, but I have to fix more
concerns/problems first. There are proposals for different approaches floating
around too.

> saves meaningful energy[2]. Non-Android users widely scorn sugov when
> they become aware of it. Web forums are full of suggestions to switch to
> perfgov, or to switch to "conservative" or disable turbo for those who
> want efficiency.

It'd be great if people come forward with their problems and help with
analysing them instead.

> 
> That said, given how long the the PELT time constant is, a bpf scheduler
> that wanted to override sugov could probably cooperate with a userspace
> daemon to set min and max uclamps to the same value to control frequency
> selection without too much overhead, as long as it doesn't mind the
> 81-100% hole.

Yes. They could do that. I'm surprised no one has done a generic uclamp daemon
to auto tune certain apps to improve their perf and efficiency. I started some
effort in the past, but sadly dropped it as I didn't have the time.

I generally think the approach of adding more QoS in the kernel then use BPF to
create app or system specific tuner to set these QoS is a more viable option in
the long run. I think BPF can hook into user app to get cues about bits they
could struggle with and they need help to auto tune somehow? ie: a certain
operation that requires boosting or causes unnecessary freq spike.

> 
> [1] https://www.phoronix.com/review/schedutil-quirky-2023
> 
> [2] Does that still hold on high-end Android devices with one or two
> hot-rodded prime cores?
> 
> Thanks,
> 
> --
> Russell Haley