lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 10 Oct 2020 18:14:23 +0200 (CEST)
From:   Julia Lawall <julia.lawall@...ia.fr>
To:     Valentin Schneider <valentin.schneider@....com>
cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-kernel@...r.kernel.org,
        Gilles Muller <Gilles.Muller@...ia.fr>
Subject: Re: SD_LOAD_BALANCE

Hello,

Previously, I was wondering about why starting in Linux v5.8 my unblocking
threads were moving to different sockets more frequently than in previous
releases.  Now, I think that I have found the reason.

The first issue is the change from runnable load average to load average
in computing wake_affine_weight:

---

commit 11f10e5420f6cecac7d4823638bff040c257aba9
Author: Vincent Guittot <vincent.guittot@...aro.org>
Date:   Fri Oct 18 15:26:36 2019 +0200

    sched/fair: Use load instead of runnable load in wakeup path

    Runnable load was originally introduced to take into account the case where
    blocked load biases the wake up path which may end to select an overloaded
    CPU with a large number of runnable tasks instead of an underutilized
    CPU with a huge blocked load.

    Tha wake up path now starts looking for idle CPUs before comparing
    runnable load and it's worth aligning the wake up path with the
    load_balance() logic.

---

The unfortunate case is illustrated by the following trace (*** on the
important lines):

       trace-cmd-8006  [118]   451.444751: sched_migrate_task:   comm=containerd pid=2481 prio=120 orig_cpu=114 dest_cpu=118
          ua.B.x-8007  [105]   451.444752: sched_switch:         ua.B.x:8007 [120] S ==> swapper/105:0 [120]
       trace-cmd-8006  [118]   451.444769: sched_switch:         ua.B.x:8006 [120] S ==> containerd:2481 [120] ***
      containerd-2481  [118]   451.444859: sched_switch:         containerd:2481 [120] S ==> swapper/118:0 [120] ***
          ua.B.x-8148  [016]   451.444910: sched_switch:         ua.B.x:8148 [120] S ==> swapper/16:0 [120]
          ua.B.x-8154  [127]   451.445186: sched_switch:         ua.B.x:8154 [120] S ==> swapper/127:0 [120]
          ua.B.x-8145  [047]   451.445199: sched_switch:         ua.B.x:8145 [120] S ==> swapper/47:0 [120]
          ua.B.x-8138  [147]   451.445200: sched_switch:         ua.B.x:8138 [120] S ==> swapper/147:0 [120]
          ua.B.x-8152  [032]   451.445210: sched_switch:         ua.B.x:8152 [120] S ==> swapper/32:0 [120]
          ua.B.x-8144  [067]   451.445215: sched_switch:         ua.B.x:8144 [120] S ==> swapper/67:0 [120]
          ua.B.x-8137  [000]   451.445219: sched_switch:         ua.B.x:8137 [120] S ==> swapper/0:0 [120]
          ua.B.x-8140  [075]   451.445225: sched_switch:         ua.B.x:8140 [120] S ==> swapper/75:0 [120]
          ua.B.x-8155  [084]   451.445229: sched_switch:         ua.B.x:8155 [120] S ==> swapper/84:0 [120]
          ua.B.x-8161  [155]   451.445232: sched_switch:         ua.B.x:8161 [120] S ==> swapper/155:0 [120]
          ua.B.x-8156  [095]   451.445261: sched_switch:         ua.B.x:8156 [120] S ==> swapper/95:0 [120]
          ua.B.x-8153  [134]   451.445268: sched_switch:         ua.B.x:8153 [120] S ==> swapper/134:0 [120]
          ua.B.x-8151  [154]   451.445268: sched_switch:         ua.B.x:8151 [120] S ==> swapper/154:0 [120]
          ua.B.x-8141  [107]   451.445273: sched_switch:         ua.B.x:8141 [120] S ==> swapper/107:0 [120]
          ua.B.x-8146  [131]   451.445275: sched_switch:         ua.B.x:8146 [120] S ==> swapper/131:0 [120]
          ua.B.x-8160  [035]   451.445286: sched_switch:         ua.B.x:8160 [120] S ==> swapper/35:0 [120]
          ua.B.x-8136  [088]   451.445286: sched_switch:         ua.B.x:8136 [120] S ==> swapper/88:0 [120]
          ua.B.x-8159  [056]   451.445290: sched_switch:         ua.B.x:8159 [120] S ==> swapper/56:0 [120]
          ua.B.x-8147  [036]   451.445294: sched_switch:         ua.B.x:8147 [120] S ==> swapper/36:0 [120]
          ua.B.x-8150  [150]   451.445298: sched_switch:         ua.B.x:8150 [120] S ==> swapper/150:0 [120]
          ua.B.x-8142  [159]   451.445300: sched_switch:         ua.B.x:8142 [120] S ==> swapper/159:0 [120]
          ua.B.x-8157  [058]   451.445309: sched_switch:         ua.B.x:8157 [120] S ==> swapper/58:0 [120]
          ua.B.x-8149  [123]   451.445311: sched_switch:         ua.B.x:8149 [120] S ==> swapper/123:0 [120]
          ua.B.x-8162  [156]   451.445313: sched_switch:         ua.B.x:8162 [120] S ==> swapper/156:0 [120]
          ua.B.x-8164  [019]   451.445317: sched_switch:         ua.B.x:8164 [120] S ==> swapper/19:0 [120]
          ua.B.x-8139  [068]   451.445319: sched_switch:         ua.B.x:8139 [120] S ==> swapper/68:0 [120]
          ua.B.x-8143  [126]   451.445335: sched_switch:         ua.B.x:8143 [120] S ==> swapper/126:0 [120]
          ua.B.x-8163  [062]   451.445361: sched_switch:         ua.B.x:8163 [120] S ==> swapper/62:0 [120]
          ua.B.x-8158  [129]   451.445371: sched_switch:         ua.B.x:8158 [120] S ==> swapper/129:0 [120]
          ua.B.x-8040  [043]   451.445413: sched_wake_idle_without_ipi: cpu=0
          ua.B.x-8165  [098]   451.445451: sched_switch:         ua.B.x:8165 [120] S ==> swapper/98:0 [120]
          ua.B.x-8069  [009]   451.445622: sched_waking:         comm=ua.B.x pid=8007 prio=120 target_cpu=105
          ua.B.x-8069  [009]   451.445635: sched_wake_idle_without_ipi: cpu=105
          ua.B.x-8069  [009]   451.445638: sched_wakeup:         ua.B.x:8007 [120] success=1 CPU:105
          ua.B.x-8069  [009]   451.445639: sched_waking:         comm=ua.B.x pid=8006 prio=120 target_cpu=118
          <idle>-0     [105]   451.445641: sched_switch:         swapper/105:0 [120] R ==> ua.B.x:8007 [120]
          ua.B.x-8069  [009]   451.445645: bprint:               select_task_rq_fair: wake_affine_weight2 returning this_cpu: 614400 < 2981888
          ua.B.x-8069  [009]   451.445650: sched_migrate_task:   comm=ua.B.x pid=8006 prio=120 orig_cpu=118 dest_cpu=129 ***

Basically, core 118 has run both a thread of the NAS UA benchmark and a
containerd, and so it seems to have a higher load average than this_cpu, ie
9, when it wakes up.  The commit says "The wake up path now starts looking
for idle CPUs", but this is not always the case.  Here the target and prev
are not on the same socket, and in that case cpus_share_cache(prev, target)
fails and there is no check as to whether prev is idle.  The result is that
an idle core is left idle and the thread is migrated to another socket,
perhaps impacting locality.

Prior to v5.8 on my machine this was a rare event, because there were not
many of these background processes.  But in v5.8, the default governor for
Intel machines without the HWP feature was changed from intel_pstate to
intel_cpufreq.  The use of intel_cpufreq triggers very frequent kworkers on
all cores, which makes it much more likely that cores that are currently
idle, and are overall not at all overloaded, will have a higher load
average even with the waking thread deducted, than the core managing the
wakeup of the threads.

Would it be useful to always check whether prev is idle, perhaps in
wake_affine_idle or perhaps in select_idle_sibling?

Traces from various versions are available at
https://pages.lip6.fr/Julia.Lawall/uas.pdf.  5.8 and 5.9-rc7 are toward the
end of the file.  In these versions, all the threads systematically move
around at synchronization points in the program.

thanks,
julia

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ