linux-kernel - Re: Scheduling for heterogeneous computers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20220321121611.ssa7o2npy3ahdofk@wubuntu>
Date:   Mon, 21 Mar 2022 12:16:11 +0000
From:   Qais Yousef <qais.yousef@....com>
To:     Paul Bone <pbone@...illa.com>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Scheduling for heterogeneous computers

Hi Paul

On 03/08/22 20:21, Paul Bone wrote:
> 
> Are there plans for power-aware scheduling on heterogeneous computers that
> processes & threads can opt-in to?
> 
> Several mainstream devices now offer power-aware heterogeneous scheduling:
> 
>  * Lots of ARM (and therefore android) devices offer big.LITTLE cores.
>  * Apple's M1 CPU has "gold" and "silver" cores.  The gold cores are faster
>    and have more cache.  I think there are other microarchitectual
>    differences.
>  * Intel's Alder Lake CPUs have P and E cores.  I'm told that the E cores
>    don't save power though since each core type still gets the same work
>    done per Watt, it's just that the P cores are bigger and faster.
>  * Multicore CPUs that offer frequency scaling could get some power savings
>    by switching off turbo boost and similar features.  They wonThe work/watt
>    improves at the cost of throughput & responsiveness.
> 
> I'm aware that Linux does some Energy Aware Scheduling
> https://docs.kernel.org/scheduler/sched-energy.html, however what I'm
> looking for is an API that processes (but ideally threads) can opt in-to
> (and out-of (unlike nice)) to say that the work they're currently doing is
> bulk work.  It needs to get done but it doesn't have a deadline and
> therefore can be done on a smaller / more power efficient core.  The idea is
> that the same work gets done eventually, but for a background task (eg
> Garbage Collection) it can be done in a greener or more
> battery-charge-extending way.
> 
> MacOS has added an API for this as:
>     pthread_set_qos_class_self_np()
>     https://developer.apple.com/documentation/apple-silicon/tuning-your-code-s-performance-for-apple-silicon?preferredLanguage=occ
> 
> Windows has:
>     ThreadPowerThrottling
>     https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadinformation
> 
> I'm not aware of anything for Linux and I've been unable to find anything.
> Are there any plans to implement this?  

We do actually have a feature called util clamp (uclamp for short) that allows
you to do that.

There's a new field in sched_setattr() to set UCLAMP_MIN and UCLAMP_MAX.

UCLAMP_MIN hints towards performance. Ie: tell the system this task needs at
least this performance level as a minimum. Which will be translated into task
placement and frequency selection by the scheduler when this task is running.

UCLAMP_MAX hints towards efficiency. Ie: tell the system this task does not
need to operate above this performance level. Like UCLAMP_MIN, this will impact
task placement and frequency selection when this task is running.

There's a tool called uclampset in util-linux v2.37.2 that allows you to play
with this. See this commit message for an example:

	https://lore.kernel.org/lkml/20211216225320.2957053-2-qais.yousef@arm.com/

There are some issues that you might need to be aware of though.

	1. UCLAMP_MAX effectiveness issues when there are multiple tasks with
	   different demands running on the same CPU.

	   This LPC talk will explain the problem:
	   https://www.youtube.com/watch?v=i5BdYn6SNQc&t=680s

	2. fits_capacity() is not uclamp aware yet, and this means the task
	   placement bias will not work as well as it should be.

I am working on both these issues and kernel documentation to help better
explain the feature. There's a cgroup interface in the cpu controller
(cpu.uclamp.min/max).

You need to use schedutil cpufreq governor.

There was a LWN article on the feature that might help with more background:

	https://lwn.net/Articles/762043/

HTH.

Cheers

--
Qais Yousef