[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55DEE556.3010802@citrix.com>
Date: Thu, 27 Aug 2015 11:24:22 +0100
From: George Dunlap <george.dunlap@...rix.com>
To: Dario Faggioli <dario.faggioli@...rix.com>,
"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>
CC: Juergen Gross <jgross@...e.com>,
Andrew Cooper <Andrew.Cooper3@...rix.com>,
"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
David Vrabel <david.vrabel@...rix.com>,
Boris Ostrovsky <boris.ostrovsky@...cle.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Stefano Stabellini <stefano.stabellini@...citrix.com>
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain
hierarchy
On 08/18/2015 04:55 PM, Dario Faggioli wrote:
> Hey everyone,
>
> So, as a followup of what we were discussing in this thread:
>
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
>
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
>
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
>
> Let me state this again (hoping to make myself as clear as possible): no
> matter in how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
>
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
>
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
>
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have any
> unexpected side effects on scheduling domains.
>
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
>
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
>
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
>
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
>
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
>
> RESULTS
> =======
> To have a quick idea of how a benchmark went, look at the '%
> improvement' row of each table.
>
> I'll put these results online, in a googledoc spreadsheet or something
> like that, to make them easier to read, as soon as possible.
>
> *** Intel(R) Xeon(R) E5620 @ 2.40GHz
> *** pCPUs 16 DOM0 vCPUS 16
> *** RAM 12285 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs -j1 -j6 -j8 -j16** -j24
> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
> ---------------------------------------------------------------------------------------------------------------------------------------
> 153.72 152.41 35.33 34.93 30.7 30.33 26.79 25.97 26.88 26.21
> 153.81 152.76 35.37 34.99 30.81 30.36 26.83 26.08 27 26.24
> 153.93 152.79 35.37 35.25 30.92 30.39 26.83 26.13 27.01 26.28
> 153.94 152.94 35.39 35.28 31.05 30.43 26.9 26.14 27.01 26.44
> 153.98 153.06 35.45 35.31 31.17 30.5 26.95 26.18 27.02 26.55
> 154.01 153.23 35.5 35.35 31.2 30.59 26.98 26.2 27.05 26.61
> 154.04 153.34 35.56 35.42 31.45 30.76 27.12 26.21 27.06 26.78
> 154.16 153.5 37.79 35.58 31.68 30.83 27.16 26.23 27.16 26.78
> 154.18 153.71 37.98 35.61 33.73 30.9 27.49 26.32 27.16 26.8
> 154.9 154.67 38.03 37.64 34.69 31.69 29.82 26.38 27.2 28.63
> ---------------------------------------------------------------------------------------------------------------------------------------
> Avg. 154.067 153.241 36.177 35.536 31.74 30.678 27.287 26.184 27.055 26.732
> ---------------------------------------------------------------------------------------------------------------------------------------
> Std. Dev. 0.325 0.631 1.215 0.771 1.352 0.410 0.914 0.116 0.095 0.704
> ---------------------------------------------------------------------------------------------------------------------------------------
> % improvement 0.536 1.772 3.346 4.042 1.194
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies 1 parallel 6 parrallel 8 parallel 16 parallel** 24 parallel
> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables 2302.2 2302.1 13157.8 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6
> Double-Precision Whetstone 620.2 620.2 3481.2 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3
> Execl Throughput 184.3 186.7 884.6 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265
> File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5
> File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 803.6 806.4 781 682.9 707.7 698.2 694.6
> File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8
> Pipe Throughput 363.9 361.6 2068.6 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7
> Pipe-based Context Switching 70.6 207.2 369.1 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077
> Process Creation 103.1 135 503 677.6 618.7 855.4 1138 1113.7 1195.6 1199
> Shell Scripts (1 concurrent) 723.2 765.3 4406.4 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1
> Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6
> System Call Overhead 330 330.1 1669.2 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5
> System Benchmarks Index Score 496.8 567.5 1861.9 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score) 14.231 13.110 9.954 1.191 0.706
> ====================================================================================================================================================
>
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs 24 DOM0 vCPUS 16
> *** RAM 36851 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs -j1 -j8 -j12 -j24** -j32
> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
> ---------------------------------------------------------------------------------------------------------------------------------------
> 119.49 119.47 23.37 23.29 20.12 19.85 17.99 17.9 17.82 17.8
> 119.59 119.64 23.52 23.31 20.16 19.99 18.19 18.05 18.23 17.89
> 119.59 119.65 23.53 23.35 20.19 20.08 18.26 18.09 18.35 17.91
> 119.72 119.75 23.63 23.41 20.2 20.14 18.54 18.1 18.4 17.95
> 119.95 119.86 23.68 23.42 20.24 20.19 18.57 18.15 18.44 18.03
> 119.97 119.9 23.72 23.51 20.38 20.31 18.61 18.21 18.49 18.03
> 119.97 119.91 25.03 23.53 20.38 20.42 18.75 18.28 18.51 18.08
> 120.01 119.98 25.05 23.93 20.39 21.69 19.99 18.49 18.52 18.6
> 120.24 119.99 25.12 24.19 21.67 21.76 20.08 19.74 19.73 19.62
> 120.66 121.22 25.16 25.36 21.94 21.85 20.26 20.3 19.92 19.81
> ---------------------------------------------------------------------------------------------------------------------------------------
> Avg. 119.919 119.937 24.181 23.73 20.567 20.628 18.924 18.531 18.641 18.372
> ---------------------------------------------------------------------------------------------------------------------------------------
> Std. Dev. 0.351 0.481 0.789 0.642 0.663 0.802 0.851 0.811 0.658 0.741
> ---------------------------------------------------------------------------------------------------------------------------------------
> % improvement -0.015 1.865 -0.297 2.077 1.443
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies 1 parallel 8 parrallel 12 parallel 24 parallel** 32 parallel
> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables 2650.1 2664.6 18967.8 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7
> Double-Precision Whetstone 713.7 713.5 5463.6 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3
> Execl Throughput 280.9 283.8 1724.4 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8
> File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5
> File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 972.1 882.8 878.6 821.9 817.7 784.7 810.8
> File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5
> Pipe Throughput 426.8 423.4 3207.9 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7
> Pipe-based Context Switching 110.2 223.5 680.8 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2
> Process Creation 130.7 224.4 1001.3 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1
> Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6
> Shell Scripts (8 concurrent) 3492 3586.7 7144.9 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2
> System Call Overhead 387.7 387.5 2398.4 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4
> System Benchmarks Index Score 634.8 712.6 2725.8 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score) 12.256 10.269 10.435 1.193 1.006
> ====================================================================================================================================================
>
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs 48 DOM0 vCPUS 16
> *** RAM 393138 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs -j1 -j20 -j24 -j48** -j62
> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
> ---------------------------------------------------------------------------------------------------------------------------------------
> 267.78 233.25 36.53 35.53 35.98 34.99 33.46 32.13 33.57 32.54
> 268.42 233.92 36.82 35.56 36.12 35.2 34.24 32.24 33.64 32.56
> 268.85 234.39 36.92 35.75 36.15 35.35 34.48 32.86 33.67 32.74
> 268.98 235.11 36.96 36.01 36.25 35.46 34.73 32.89 33.97 32.83
> 269.03 236.48 37.04 36.16 36.45 35.63 34.77 32.97 34.12 33.01
> 269.54 237.05 40.33 36.59 36.57 36.15 34.97 33.09 34.18 33.52
> 269.99 238.24 40.45 36.78 36.58 36.22 34.99 33.69 34.28 33.63
> 270.11 238.48 41.13 39.98 40.22 36.24 38 33.92 34.35 33.87
> 270.96 239.07 41.66 40.81 40.59 36.35 38.99 34.19 34.49 37.24
> 271.84 240.89 42.07 41.24 40.63 40.06 39.07 36.04 34.69 37.59
> ---------------------------------------------------------------------------------------------------------------------------------------
> Avg. 269.55 236.688 38.991 37.441 37.554 36.165 35.77 33.402 34.096 33.953
> ---------------------------------------------------------------------------------------------------------------------------------------
> Std. Dev. 1.213 2.503 2.312 2.288 2.031 1.452 2.079 1.142 0.379 1.882
> ---------------------------------------------------------------------------------------------------------------------------------------
> % improvement 12.191 3.975 3.699 6.620 0.419
> ========================================================================================================================================
I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
tests, you change the -j number (apparently) based on the number of
pcpus available to Xen. Wouldn't it make more sense to stick with
1/6/8/16/24? That would allow us to have actually comparable numbers.
But in any case, it seems to me that the numbers do show a uniform
improvement and no regressions -- I think this approach looks really
good, particularly as it is so small and well-contained.
-George
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists