lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87r2jnf6w0.fsf@riseup.net>
Date:   Sat, 28 Jul 2018 13:21:51 -0700
From:   Francisco Jerez <currojerez@...eup.net>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        lenb@...nel.org, rjw@...ysocki.net, peterz@...radead.org,
        ggherdovich@...e.cz, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org, juri.lelli@...hat.com,
        viresh.kumar@...aro.org, Chris Wilson <chris@...is-wilson.co.uk>,
        Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>,
        Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
        Eero Tamminen <eero.t.tamminen@...el.com>
Subject: Re: [PATCH 4/4] cpufreq: intel_pstate: enable boost for Skylake Xeon

Mel Gorman <mgorman@...hsingularity.net> writes:

> On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com> writes:
>> 
>> > Enable HWP boost on Skylake server and workstations.
>> >
>> 
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1].  The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the intel_pstate
>> driver to decrease the EPP to zero when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to achieve
>> the same work, causing the GPU to have a lower power budget available,
>> leading to a decrease in system performance.
>
> It slices both ways.

I don't think it's acceptable to land an optimization that trades
performance of one use-case for another, especially since one could make
both use-cases happy by avoiding the boost in cases where we know
beforehand that we aren't going to achieve any improvement in
performance, because an application waiting frequently on an IO device
which is 100% utilized isn't going to run faster just because we ramp up
the CPU frequency, since the IO device won't be able to process requests
from the application faster anyway, so we will only be pessimizing
energy efficiency (and potentially decreasing performance of the GPU
*and* of other CPU cores living on the same package for no benefit).

> With the series, there are large boosts to performance on other
> workloads where a slight increase in power usage is acceptable in
> exchange for performance. For example,
>
> Single socket skylake running sqlite
>                                  v4.17               41ab43c9
> Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
> Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
> Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
> CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
> Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
> BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
> BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
>
> That's over doubling the transactions per second for that workload.
>
> Two-socket skylake running dbench4
>                                 v4.17               41ab43c9
> Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
> Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
> Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
> Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
> Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
> Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
> Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
> Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
>
> This is reporting the average latency of operations running
> dbench. The series over halves the latencies. There are many examples
> of basic workloads that benefit heavily from the series and while I
> accept it may not be universal, such as the case where the graphics
> card needs the power and not the CPU, a straight revert is not the
> answer. Without the series, HWP cripplies the CPU.
>

That seems like a huge overstatement.  HWP doesn't "cripple" the CPU
without this series.  It will certainly set lower clocks than with this
series for workloads like you show above that utilize the CPU very
intermittently (i.e. they underutilize it).  But one could argue that
such workloads are inherently misdesigned and will perform suboptimally
regardless of the behavior of the CPUFREQ governor, for two different
reasons: On the one hand because they are unable to fully utilize their
CPU time (otherwise HWP would be giving them a CPU frequency close to
the maximum already), and on the other hand, because in order to achieve
maximum performance they will necessarily have to bounce back and forth
between the maximum P-state and idle at high frequency, which is
inherently energy-inefficient and will effectively *decrease* the
overall number of requests per second that an actual multi-threaded
server can process, even though the request throughput may seem to
increase in a single-threaded benchmark.

> -- 
> Mel Gorman
> SUSE Labs


Download attachment "signature.asc" of type "application/pgp-signature" (228 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ