lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 16 Oct 2018 11:28:29 +0200
From:   Lukasz Luba <l.luba@...tner.samsung.com>
To:     Ingo Molnar <mingo@...nel.org>,
        Thara Gopinath <thara.gopinath@...aro.org>
Cc:     linux-kernel@...r.kernel.org, mingo@...hat.com,
        peterz@...radead.org, rui.zhang@...el.com,
        gregkh@...uxfoundation.org, rafael@...nel.org,
        amit.kachhap@...il.com, viresh.kumar@...aro.org,
        javi.merino@...nel.org, edubezval@...il.com,
        daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
        quentin.perret@....com, ionela.voinescu@....com,
        vincent.guittot@...aro.org, l.luba@...tner.samsung.com,
        Bartlomiej Zolnierkiewicz <b.zolnierkie@...sung.com>
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


On 10/16/2018 09:33 AM, Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@...aro.org> wrote:
> 
>>>> Regarding testing, basic build, boot and sanity testing have been
>>>> performed on hikey960 mainline kernel with debian file system.
>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>> floating point performance) showed the following results on hikey960
>>>> with debain.
>>>>
>>>>                                          Result          Standard        Standard
>>>>                                          (Time secs)     Error           Deviation
>>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>>
>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>
>>> I'm wondering what happens if we do not track and decay the thermal
>>> load at all at the PELT level, but instantaneously decrease/increase
>>> effective CPU capacity in reaction to thermal events we receive from
>>> the CPU.
>>
>> The problem with instantaneous update is that sometimes thermal events
>> happen at a much faster pace than cpu_capacity is updated in the
>> scheduler. This means that at the moment when scheduler uses the
>> value, it might not be correct anymore.
> 
> Let me offer a different interpretation: if we average throttling events
> then we create a 'smooth' average of 'true CPU capacity' that doesn't
> fluctuate much. This allows more stable yet asymmetric task placement if
> the thermal characteristics of the different cores is different
> (asymmetric). This, compared to instantaneous updates, would reduce
> unnecessary task migrations between cores.
> 
> Is that accurate?
> 
> If the thermal characteristics of the cores is roughly symmetric and the
> measured CPU-intense load itself is symmetric as well, then I have
> trouble seeing why reacting to thermal events should make any difference
> at all.
> 
> Are there any inherent asymmetries in the thermal properties of the
> cores, or in the benchmarked workload itself?
The aobench that at least I have built is a single threaded app.
If there is migration of the process to cluster and core which is in
avg faster, then it will gain.
The hikey960 platform has limited number of OPPs.
big cluster: 2.36, 2.1, 1.8, 1.4, 0.9 [GHz]
little cluster: 1.84, 1.7, 1.4, 1.0, 0.5 [GHz]
Comparing to Exynos5433 which has 15 OPPs for big cluster every 100MHZ,
it is harder to pick-up the right one.
I can imagine that the thermal governor is jumping around 1.8, 1.4, 0.9
for the big cluster. Maybe little cluster is at higher OPP
and running there longer would help. Thermal has time slots are 100ms
(based on this DT).

Regarding other asymmetries, there are different parts of the cluster
and core utilized depending of workload and data set.
There might be floating point or vectorized code utilizing long piplines
in NEON and also causing less cache misses.
That will warm up more than integer unit or copy using load/store unit
(which occupy less silicon (and C 'capacitance')) at the same frequency.

There are also SoCs which have single power rail from DCDC in PMIC
for both asymmetric clusters. In SoC on front of these clusters,
there is internal LDO, which reduces the voltage to the cluster.
In such system cpufreq driver chooses max of the voltages for the
clusters and sets it to the PMIC, then sets LDOx voltage diff for
cluster with smaller voltage. This causes another asymmetries,
because more current going through LDO causes more heat than
direct DCDC voltage (i.e. seen as a heat on big cluster).

There are also cache portion power down asymmetries.
I have been developing such driver. Based on memory traffic
and cache hit/miss ratio it chooses how much cache can be powered down.
I can image that some HW does it without the need of SW assist.

There are SoCs with DDR modules mounted on top - PoP.
I still have to investigate what is different in SoC power budget
in such setup (depending on workload).

There are also workloads for UI using GPU, which can also
be utilized in 'portions' (shader cores from 1 to 32).

These asymmetries cause that simple assumptio
P_dynamic = C * V^2 * f
is probably not enough.

I would suggest to choose platform with more fine grained OPPs or
add more points to hikey960 and repeat the tests.

Regards,
Lukasz Luba

> 
> Thanks,
> 
> 	Ingo
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ