linux-kernel - Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49e98b00-80c7-b3a4-30fd-bccb382d002b@oracle.com>
Date:   Tue, 21 Nov 2017 23:23:18 -0600
From:   Atish Patra <atish.patra@...cle.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Joel Fernandes <joelaf@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Brendan Jackman <brendan.jackman@....com>,
        Josef Bacik <jbacik@...com>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race
 window.

Here are the results of schbench(scheduler latency benchmark) and uperf 
(networking benchmark).

Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
schbench config: message threads = 2; time = 180s, worker thread = variable
uperf config:ping pong test on loopback interface with message size = 8k

Overall, both benchmark seems to happiest when number of threads are 
closer to number of cpus.
--------------------------------------------------------------------------------------------------------------------------
schbench Maximum Latency(lower is better):
             Base(4.14)                 Base+pcpu
Num
Worker Mean            Stdev        Mean         Stdev Improvement (%)
10       3026.8     4987.12           523         474.35 82.7210255055
18       13854.6   1841.61        12945.6     125.19 6.5609977913
19       16457      2046.51        12985.4     48.46 21.0949747828
20       14995      2368.84        15838       2038.82 -5.621873958
25       29952.2   107.72           29673.6   337.57 0.9301487036
30       30084      19.768           30096.2     7.782 -0.0405531179

-------------------------------------------------------------------------------------------------------------------

The proposed fix seems to improve the maximum latency for lower number
of threads. It also seems to reduce the variation(lower stdev) as well.

If number of threads are equal or higher than number of cpus, it results 
in significantly
higher latencies in because of the nature of the benchmark. Results for 
higher
threads use case are presented to provide a complete picture but it is 
difficult to conclude
anything from that.

Next individual percentile results are present for each use case. The 
proposed fix also
improves latency across all percentiles for configuration(19 worker 
threads) which
should saturate the system.
---------------------------------------------------------------------------------------------------------------------
schbench Latency in usec(lower is better)
             Baseline(4.14)           Base+pcpu
Num
Worker Mean       stdev        Mean    stdev    Improvement(%)

50th
10       64.2            2.039        63.6    1.743        0.934
18       57.6            5.388        57        4.939        1.041
19       63               4.774        58        4 7.936
20       59.6           4.127        60.2    5.153        -1.006
25       78.4           0.489        78.2    0.748        0.255
30       96.2           0.748        96.4    1.019        -0.207

75th
10        72             3.033        71.6      2.939        0.555
18        78             2.097        77.2      2.135        1.025
19        81.6          1.2            79.4      0.8 2.696
20        81             1.264        80.4      2.332        0.740
25        109.6        1.019       110        0               -0.364
30        781.4        50.902     731.8   70.6382    6.3475

90th
10        80.4          3.666       80.6        2.576        -0.248
18        87.8          1.469       88           1.673        -0.227
19        92.8          0.979       90.6        0.489        2.370
20        92.6          1.019       92            2 0.647
25        8977.6   1277.160   9014.4    467.857   -0.409
30        9558.4   334.641     9507.2   320.383    0.5356

95th
10        86.8         3.867        87.6         4.409        -0.921
18        95.4         1.496        95.2         2.039        0.209
19        102.6       1.624        99            0.894        3.508
20        103.2       1.326        102.2       2.481        0.968
25        12400      78.383      12406.4   37.318    -0.051
30        12336      40.477      12310.4   12.8        0.207

99th
10        99.2          5.418       103.4        6.887        -4.233
18        115.2        2.561       114.6        3.611        0.5208
19        126.25      4.573       120.4        3.872        4.6336
20        145.4        3.09          133           1.41 8.5281
25        12988.8   15.676      12924.8   25.6          0.4927
30        12988.8   15.676      12956.8   32.633     0.2463

99.50th
10        104.4        5.161         109.8    7.909          -5.172
18        127.6        7.391         124.2    4.214          2.6645
19        2712.2      4772.883  133.6    5.571          95.074
20        3707.8      2831.954  2844.2  4708.345    23.291
25        14032       1283.834  13008   0                  7.2976
30        16550.4    886.382    13840   1218.355    16.376
------------------------------------------------------------------------------------------------------------------------ 


Results from uperf
uperf config: Loopback ping pong test with message size = 8k

         Baseline (4.14)            Baseline +pcpu
          Mean        stdev        Mean        stdev Improvement(%)
1      9.056        0.02          8.966        0.083        -0.993
2      17.664      0.13         17.448       0.303       -1.222
4      32.03        0.22         31.972       0.129       -0.181
8      58.198      0.31         58.588       0.198        0.670
16    101.018   0.67        100.056      0.455        -0.952
32    148.1        15.41     164.494      2.312        11.069
64    203.66      1.16       203.042      1.348        -0.3073
128   197.12     1.04       194.722     1.174         -1.2165

The race window fix seems to help uperf for 32 threads (closest to 
number of cpus) as well.

Regards,
Atish

On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> Hi Peter,
>
> On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>>> Currently, multiple tasks can wakeup on same cpu from
>>> select_idle_sibiling() path in case they wakeup simulatenously
>>> and last ran on the same llc. This happens because an idle cpu
>>> is not updated until idle task is scheduled out. Any task waking
>>> during that period may potentially select that cpu for a wakeup
>>> candidate.
>>>
>>> Introduce a per cpu variable that is set as soon as a cpu is
>>> selected for wakeup for any task. This prevents from other tasks
>>> to select the same cpu again. Note: This does not close the race
>>> window but minimizes it to accessing the per-cpu variable. If two
>>> wakee tasks access the per cpu variable at the same time, they may
>>> select the same cpu again. But it minimizes the race window
>>> considerably.
>> The very most important question; does it actually help? What
>> benchmarks, give what numbers?
> I collected some numbers with an Android benchmark called Jankbench.
> Most tests didn't show an improvement or degradation with the patch.
> However, one of the tests called "list view",  consistently shows an
> improvement. Particularly striking is the improvement at mean and 25
> percentile.
>
> For list_view test, Jankbench pulls up a list of text and scrolls the
> list, this exercises the display pipeline in Android to render and
> display the animation as the scroll happens. For Android, lower frame
> times is considered quite important as that means we are less likely
> to drop frames and give the user a good experience vs a perceivable
> poor experience.
>
> For each frame, Jankbench measures the total time a frame takes and
> stores it in a DB (the time from which the app starts drawing, to when
> the rendering completes and the frame is submitted for display).
> Following is the distribution of frame times in ms.
>
> count    16304   (@60 fps, 4.5 minutes)
>
>          Without patch   With patch
> mean         5.196633   4.429641 (+14.75%)
> std          2.030054   2.310025
> 25%          5.606810   1.991017 (+64.48%)
> 50%          5.824013   5.716631 (+1.84%)
> 75%          5.987102   5.932751 (+0.90%)
> 95%          6.461230   6.301318 (+2.47%)
> 99%          9.828959   9.697076 (+1.34%)
>
> Note that although Android uses energy aware scheduling patches, I
> turned those off to bring the test as close to mainline as possible. I
> also backported Vincent's and Brendan's slow path fixes to the 4.4
> kernel that the Pixel 2 uses.
>
> Personally I am in favor of this patch considering this test data but
> also that in the past, I remember that our teams had to deal with the
> same race issue and used cpusets to avoid it (although they probably
> tested with "energy aware" CPU selection kept on).
>
> thanks,
>
> - Joel