linux-kernel - Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3802e27a-56ed-9495-21b9-7c4277065155@linux.intel.com>
Date:   Thu, 10 Dec 2020 16:23:47 +0800
From:   "Li, Aubrey" <aubrey.li@...ux.intel.com>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, valentin.schneider@....com,
        qais.yousef@....com, dietmar.eggemann@....com, rostedt@...dmis.org,
        bsegall@...gle.com, tim.c.chen@...ux.intel.com,
        linux-kernel@...r.kernel.org, Mel Gorman <mgorman@...e.de>,
        Jiang Biao <benbjiang@...il.com>
Subject: Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for
 task wakeup

Hi Mel,

On 2020/12/9 22:36, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
> 
> I ran this patch with tbench on top of of the schedstat patches that
> track SIS efficiency. The tracking adds overhead so it's not a perfect
> performance comparison but the expectation would be that the patch reduces
> the number of runqueues that are scanned

Thanks for the measurement! I don't play with tbench so may need a while
to digest the data.

> 
> tbench4
>                           5.10.0-rc6             5.10.0-rc6
>                       schedstat-v1r1          idlemask-v7r1
> Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
> Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
> Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
> Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
> Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
> Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*
> Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*
> Hmean     128    22278.41 (   0.00%)    22304.33 *   0.12%*
> Hmean     256    21455.52 (   0.00%)    20900.13 *  -2.59%*
> Hmean     320    21802.38 (   0.00%)    21928.81 *   0.58%*
> 
> Not very optimistic result. The schedstats indicate;

How many client threads was the following schedstats collected?

> 
>                                 5.10.0-rc6     5.10.0-rc6
>                             schedstat-v1r1  idlemask-v7r1
> Ops TTWU Count               5599714302.00  5589495123.00
> Ops TTWU Local               2687713250.00  2563662550.00
> Ops SIS Search               5596677950.00  5586381168.00
> Ops SIS Domain Search        3268344934.00  3229088045.00
> Ops SIS Scanned             15909069113.00 16568899405.00
> Ops SIS Domain Scanned      13580736097.00 14211606282.00
> Ops SIS Failures             2944874939.00  2843113421.00
> Ops SIS Core Search           262853975.00   311781774.00
> Ops SIS Core Hit              185189656.00   216097102.00
> Ops SIS Core Miss              77664319.00    95684672.00
> Ops SIS Recent Used Hit       124265515.00   146021086.00
> Ops SIS Recent Used Miss      338142547.00   403547579.00
> Ops SIS Recent Attempts       462408062.00   549568665.00
> Ops SIS Search Efficiency            35.18          33.72
> Ops SIS Domain Search Eff            24.07          22.72
> Ops SIS Fast Success Rate            41.60          42.20
> Ops SIS Success Rate                 47.38          49.11
> Ops SIS Recent Success Rate          26.87          26.57
> 
> The field I would expect to decrease is SIS Domain Scanned -- the number
> of runqueues that were examined but it's actually worse and graphing over
> time shows it's worse for the client thread counts.  select_idle_cpu()
> is definitely being called because "Domain Search" is 10 times higher than
> "Core Search" and there "Core Miss" is non-zero.

Why SIS Domain Scanned can be decreased?

I thought SIS Scanned was supposed to be decreased but it seems not on your side.

I printed some trace log on my side by uperf workload, and it looks properly.
To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.

stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
========================================================================================
          <idle>-0       [002] d..1   137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
          <idle>-0       [000] d..1   137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
          <idle>-0       [003] d..1   137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
          <idle>-0       [001] d..1   137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3

stage 2: uperf ramp up, cpumask changes back and forth
========================================================
           uperf-561     [003] d..3   137.410620: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   137.411384: select_task_rq_fair: scanning: 0-3
    kworker/u8:3-110     [000] d..4   137.411436: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d.h1   137.412562: update_idle_cpumask: set_idle-0, cpumask: 1-3
           uperf-570     [002] d.h2   137.412580: update_idle_cpumask: set_idle-0, cpumask: 1,3
          <idle>-0       [002] d..1   137.412917: update_idle_cpumask: set_idle-1, cpumask: 1-3
          <idle>-0       [000] d..1   137.413004: update_idle_cpumask: set_idle-1, cpumask: 0-3
           uperf-560     [000] d..5   137.415856: select_task_rq_fair: scanning: 0-3
    kworker/u8:3-110     [001] d..4   137.415956: select_task_rq_fair: scanning: 0-3
            sshd-504     [003] d.h1   137.416562: update_idle_cpumask: set_idle-0, cpumask: 0-2
           uperf-560     [000] d.h1   137.416598: update_idle_cpumask: set_idle-0, cpumask: 1-2
          <idle>-0       [003] d..1   137.416638: update_idle_cpumask: set_idle-1, cpumask: 1-3
          <idle>-0       [000] d..1   137.417076: update_idle_cpumask: set_idle-1, cpumask: 0-3
    tmux: server-528     [001] d.h.   137.448566: update_idle_cpumask: set_idle-0, cpumask: 0,2-3
          <idle>-0       [001] d..1   137.448980: update_idle_cpumask: set_idle-1, cpumask: 0-3

stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning
===================================================================================================
           uperf-560     [000] d..3   138.418494: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..3   138.418506: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..3   138.418514: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] dN.3   138.418534: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] dN.3   138.418543: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] dN.3   138.418551: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418577: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418600: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418617: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418640: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418652: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418662: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418672: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   138.418676: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418693: select_task_rq_fair: scanning: 0-3
    kworker/u8:3-110     [002] d..4   138.418746: select_task_rq_fair: scanning: 0-3

stage 4: scheduler tick comes, update idle cpumask to EMPTY
============================================================
           uperf-572     [002] d.h.   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
           uperf-574     [000] d.H2   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
           uperf-565     [003] d.H6   139.420569: update_idle_cpumask: set_idle-0, cpumask: 1
    tmux: server-528     [001] d.h2   139.420572: update_idle_cpumask: set_idle-0, cpumask: 


stage 5: uperf continue running, select_idle_cpu does not scan idle cpu
========================================================================
<I only run 2 seconds uperf, during this two seconds, no idle cpu in cpumask to scan>

           uperf-565     [003] d.sa   139.420587: select_task_rq_fair: scanning: 
           uperf-572     [002] d.sa   139.420670: select_task_rq_fair: scanning: 
	   ............
           uperf-561     [003] d..5   141.421620: select_task_rq_fair: scanning: 
           uperf-571     [001] d.sa   141.421630: select_task_rq_fair: scanning: 

stage 6: uperf benchmark finished, idle thread switch on 
=========================================================

          <idle>-0       [002] d..1   141.421631: update_idle_cpumask: set_idle-1, cpumask: 2
          <idle>-0       [000] d..1   141.421654: update_idle_cpumask: set_idle-1, cpumask: 0,2
          <idle>-0       [001] d..1   141.421665: update_idle_cpumask: set_idle-1, cpumask: 0-2
           uperf-561     [003] d..5   141.421712: select_task_rq_fair: scanning: 0-2
          <idle>-0       [003] d..1   141.421807: update_idle_cpumask: set_idle-1, cpumask: 0-3


stage 7: uperf ramp down
==========================
           uperf-560     [000] d..5   141.423075: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   141.423107: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   141.423259: select_task_rq_fair: scanning: 0-3
    tmux: server-528     [002] d..5   141.730489: select_task_rq_fair: scanning: 1-3
    kworker/u8:1-96      [003] d..4   141.731924: select_task_rq_fair: scanning: 1-3
          <idle>-0       [000] d..1   141.734560: update_idle_cpumask: set_idle-1, cpumask: 0-3
    tmux: server-528     [002] d.h1   141.736568: update_idle_cpumask: set_idle-0, cpumask: 0-1
        uperf.sh-558     [003] d.h.   141.736570: update_idle_cpumask: set_idle-0, cpumask: 0-1
          <idle>-0       [002] d..1   141.736718: update_idle_cpumask: set_idle-1, cpumask: 0-2
          <idle>-0       [003] d..1   141.738179: update_idle_cpumask: set_idle-1, cpumask: 0-3
           pkill-578     [001] d.h1   141.740569: update_idle_cpumask: set_idle-0, cpumask: 0,2-3
          <idle>-0       [001] d..1   141.741875: update_idle_cpumask: set_idle-1, cpumask: 0-3
           pkill-578     [001] d.h.   141.744570: update_idle_cpumask: set_idle-0, cpumask: 0,2-3
           pkill-578     [001] d..6   141.770012: select_task_rq_fair: scanning: 0,2-3
          <idle>-0       [001] d..1   141.770938: update_idle_cpumask: set_idle-1, cpumask: 0-3

In this case, SIS Scanned should be decreased. I'll apply your schedstat patch to see if
the data matches.

> 
> I suspect the issue is that the mask is only marked busy from the tick
> context which is a very wide window. If select_idle_cpu() picks an idle
> CPU from the mask, it's still marked as idle in the mask.
> 
That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected
CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than
the default sched_domain_span(sd).

Thanks,
-Aubrey