linux-kernel - Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55DEE556.3010802@citrix.com>
Date:	Thu, 27 Aug 2015 11:24:22 +0100
From:	George Dunlap <george.dunlap@...rix.com>
To:	Dario Faggioli <dario.faggioli@...rix.com>,
	"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>
CC:	Juergen Gross <jgross@...e.com>,
	Andrew Cooper <Andrew.Cooper3@...rix.com>,
	"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
	David Vrabel <david.vrabel@...rix.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain
 hierarchy

On 08/18/2015 04:55 PM, Dario Faggioli wrote:
> Hey everyone,
> 
> So, as a followup of what we were discussing in this thread:
> 
>  [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>  http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
> 
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
> 
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
> 
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
> 
> Let me state this again (hoping to make myself as clear as possible): no
> matter in  how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
> 
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
> 
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
> 
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
> 
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have  any
> unexpected side effects on scheduling domains.
> 
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
> 
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
> 
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
> 
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
> 
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
> 
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
> 
> RESULTS
> =======
> To have a quick idea of how a benchmark went, look at the '%
> improvement' row of each table.
> 
> I'll put these results online, in a googledoc spreadsheet or something
> like that, to make them easier to read, as soon as possible.
> 
> *** Intel(R) Xeon(R) E5620 @ 2.40GHz                                                                                                                    
> *** pCPUs      16        DOM0 vCPUS  16
> *** RAM        12285 MB  DOM0 Memory 9955 MB
> *** NUMA nodes 2         
> =======================================================================================================================================
> MAKE XEN (lower == better)                                                                                                                            
> =======================================================================================================================================
> # of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24                
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
>                               153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
>                               153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
>                               153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
>                               153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
>                               154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
>                               154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
>                               154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
>                               154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
>                               154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
> vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
> Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
> Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
> File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
> File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
> File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
> Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
> Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
> Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
> Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
> Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
> System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
> System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
> ====================================================================================================================================================
> 
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs      24        DOM0 vCPUS  16
> *** RAM        36851 MB  DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
>                               119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
>                               119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
>                               119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
>                               119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
>                               119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
>                               119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
>                               120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
>                               120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
>                               120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
> vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
> Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
> Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
> File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
> File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
> File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
> Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
> Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
> Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
> Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
> Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
> System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
> System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
> ====================================================================================================================================================
> 
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs      48        DOM0 vCPUS  16
> *** RAM        393138 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
>                               268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
>                               268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
>                               268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
>                               269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
>                               269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
>                               269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
>                               270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
>                               270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
>                               271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
> ========================================================================================================================================

I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
tests, you change the -j number (apparently) based on the number of
pcpus available to Xen.  Wouldn't it make more sense to stick with
1/6/8/16/24?  That would allow us to have actually comparable numbers.

But in any case, it seems to me that the numbers do show a uniform
improvement and no regressions -- I think this approach looks really
good, particularly as it is so small and well-contained.

 -George


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/