[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1598998.9rBByrtVSM@aspire.rjw.lan>
Date: Thu, 09 Aug 2018 23:52:29 +0200
From: "Rafael J. Wysocki" <rjw@...ysocki.net>
To: Quentin Perret <quentin.perret@....com>
Cc: peterz@...radead.org, linux-kernel@...r.kernel.org,
linux-pm@...r.kernel.org, gregkh@...uxfoundation.org,
mingo@...hat.com, dietmar.eggemann@....com,
morten.rasmussen@....com, chris.redpath@....com,
patrick.bellasi@....com, valentin.schneider@....com,
vincent.guittot@...aro.org, thara.gopinath@...aro.org,
viresh.kumar@...aro.org, tkjos@...gle.com, joel@...lfernandes.org,
smuckle@...gle.com, adharmap@...cinc.com, skannan@...cinc.com,
pkondeti@...eaurora.org, juri.lelli@...hat.com,
edubezval@...il.com, srinivas.pandruvada@...ux.intel.com,
currojerez@...eup.net, javi.merino@...nel.org
Subject: Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
On Tuesday, July 24, 2018 2:25:10 PM CEST Quentin Perret wrote:
> Several subsystems in the kernel (task scheduler and/or thermal at the
> time of writing) can benefit from knowing about the energy consumed by
> CPUs. Yet, this information can come from different sources (DT or
> firmware for example), in different formats, hence making it hard to
> exploit without a standard API.
>
> As an attempt to address this, introduce a centralized Energy Model
> (EM) management framework which aggregates the power values provided
> by drivers into a table for each frequency domain in the system. The
> power cost tables are made available to interested clients (e.g. task
> scheduler or thermal) via platform-agnostic APIs. The overall design
> is represented by the diagram below (focused on Arm-related drivers as
> an example, but applicable to any architecture):
>
> +---------------+ +-----------------+ +-------------+
> | Thermal (IPA) | | Scheduler (EAS) | | Other |
> +---------------+ +-----------------+ +-------------+
> | | em_fd_energy() |
> | | em_cpu_get() |
> +-----------+ | +----------+
> | | |
> v v v
> +---------------------+
> | |
> | Energy Model |
> | |
> | Framework |
> | |
> +---------------------+
> ^ ^ ^
> | | | em_register_freq_domain()
> +----------+ | +---------+
> | | |
> +---------------+ +---------------+ +--------------+
> | cpufreq-dt | | arm_scmi | | Other |
> +---------------+ +---------------+ +--------------+
> ^ ^ ^
> | | |
> +--------------+ +---------------+ +--------------+
> | Device Tree | | Firmware | | ? |
> +--------------+ +---------------+ +--------------+
>
> Drivers (typically, but not limited to, CPUFreq drivers) can register
> data in the EM framework using the em_register_freq_domain() API. The
> calling driver must provide a callback function with a standardized
> signature that will be used by the EM framework to build the power
> cost tables of the frequency domain. This design should offer a lot of
> flexibility to calling drivers which are free of reading information
> from any location and to use any technique to compute power costs.
> Moreover, the capacity states registered by drivers in the EM framework
> are not required to match real performance states of the target. This
> is particularly important on targets where the performance states are
> not known by the OS.
>
> On the client side, the EM framework offers APIs to access the power
> cost tables of a CPU (em_cpu_get()), and to estimate the energy
> consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
> such as the task scheduler can then use these APIs to access the shared
> data structures holding the Energy Model of CPUs.
I'm a bit concerned that the code here appears to be designed around the
frequency domains concept which seems to be a limitation and which probably
is related to the properties of the current generation of hardware.
Assumptions like that tend to get tangled into the code tightly over time
and they may be hard to untangle from it when new use cases arise later.
For example, there probably will be more firmware involvement in future
systems and the firmware may not be willing to expose "raw" frequency
domains to the OS. That already is the case with P-states on Intel HW and
with ACPI CPPC in general.
IMO, frequency domains in your current code could be replaced with something
more general, like "performance domains" providing the scheduler with the
(relative) cost of running a task on a busy (non-idle) CPU (and, analogously,
"idle domains" that would provide the scheduler with the - relative - cost
of waking up an idle CPU to run a task on it or, the other way around, the
possible relative gain from taking all tasks away from a CPU in order to make
it go idle).
Also bear in mind that the CPUs the scheduler deals with are logical ones,
so they may be like hardware threads within a single core, for example.
Thanks,
Rafael
Powered by blists - more mailing lists