linux-kernel - Re: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5600DC4B.1000509@suse.com>
Date:	Tue, 22 Sep 2015 06:42:51 +0200
From:	Juergen Gross <jgross@...e.com>
To:	Dario Faggioli <dario.faggioli@...rix.com>
Cc:	"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>,
	Andrew Cooper <Andrew.Cooper3@...rix.com>,
	"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	George Dunlap <George.Dunlap@...rix.com>,
	David Vrabel <david.vrabel@...rix.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>
Subject: Re: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the
 scheduling domain hierarchy

On 09/21/2015 07:49 AM, Juergen Gross wrote:
> On 09/15/2015 06:50 PM, Dario Faggioli wrote:
>> On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>> Hey everyone,
>>>>
>>>> So, as a followup of what we were discussing in this thread:
>>>>
>>>>    [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>>
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>>
>>>>
>>>> I started looking in more details at scheduling domains in the Linux
>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>>>> of interacting, while this thing I'm proposing here is completely
>>>> independent from them both.
>>>>
>>>> In fact, no matter whether vNUMA is supported and enabled, and no
>>>> matter
>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>> misleading information, I think that we should do something about how
>>>> scheduling domains are build.
>>>>
>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>> Linux, by looking at *any* topology information, because that just does
>>>> not make any sense, when vcpus move around.
>>>>
>>>> Let me state this again (hoping to make myself as clear as
>>>> possible): no
>>>> matter in  how much good shape we put CPUID support, no matter how
>>>> beautifully and consistently that will interact with both vNUMA,
>>>> licensing requirements and whatever else. It will be always possible
>>>> for
>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>> should really not skew his load balancing logic toward any of those two
>>>> situations, as neither of them could be considered correct (since
>>>> nothing is!).
>>>>
>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>> different, but I haven't looked at how to make the same thing happen in
>>>> there as well.
>>>>
>>>> OVERALL DESCRIPTION
>>>> ===================
>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>> domains in such a way that there is only one of them, spanning all the
>>>> pCPUs of the guest.
>>>>
>>>> Note that the patch deals directly with scheduling domains, and
>>>> there is
>>>> no need to alter the masks that will then be used for building and
>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.).
>>>> That is
>>>> the main difference between it and the patch proposed by Juergen here:
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>>
>>>>
>>>> This means that when, in future, we will fix CPUID handling and make it
>>>> comply with whatever logic or requirements we want, that won't have
>>>> any
>>>> unexpected side effects on scheduling domains.
>>>>
>>>> Information about how the scheduling domains are being constructed
>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>> 'sched_debug' parameter. It is also possible to look
>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>
>>>> With the patch applied, only one scheduling domain is created, called
>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>> tell that from the fact that every cpu* folder
>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>> domain.
>>>>
>>>> EVALUATION
>>>> ==========
>>>> I've tested this with UnixBench, and by looking at Xen build time, on a
>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>> something similar to this in DomU already, AFAUI).
>>>>
>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>> and 'vanilla', respectively, in the tables below), and with different
>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>> the benchmarks (in the case of UnixBench).
>>>>
>>>> What I get from the numbers is that the patch almost always brings
>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>> where we regress, but always only slightly so, especially if comparing
>>>> that to the magnitude of some of the improvement that we get.
>>>>
>>>> Bear also in mind that these results are gathered from Dom0, and
>>>> without
>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>> am expecting even better results.
>>>>
>>> ...
>>>> REQUEST FOR COMMENTS
>>>> ====================
>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>    - what you guys thing of the approach,
>>>
>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>> myself) discussed this topic again.
>>>
>> Hey,
>>
>> Sorry for replying so late, I've been on vacation from right after
>> XenSummit up until yesterday. :-)
>>
>>> Regarding a possible future scenario with credit2 eventually supporting
>>> gang scheduling on hyperthreads (which is desirable due to security
>>> reasons [side channel attack] and fairness) my patch seems to be more
>>> suited for that direction than yours.
>>>
>> Ok. Just let me mention that 'Credit2 + gang scheduling' might not be
>> exactly around the corner (although, we can prioritize working on it if
>> we want).
>>
>> In principle, I think it's a really nice idea. I still don't have clear
>> in mind how we would handle a couple of situations, but let's leave this
>> aside for now, and stay on-topic.
>>
>>> Correct me if I'm wrong, but I
>>> think scheduling domains won't enable the guest kernel's scheduler to
>>> migrate threads more easily between hyperthreads opposed to other vcpus,
>>> while my approach can easily be extended to do so.
>>>
>> I'm not sure I understand what you mean here. As far as the (Linux)
>> scheduler is concerned, your patch and mine do the exact same thing:
>> they arrange for the scheduling domains, when they're built, during
>> boot, not to consider hyperthreads or multi-cores.
>>
>> Mine does it by removing the SMT (and the MC) level from the data
>> structure in the scheduler that is used as a base for configuring the
>> scheduling domains. Yours does it by making the topology bitmaps that
>> are used at each one of those level all look the same. In fact, with
>> your patch applied, I get the exact same situation as with mine, as far
>> as scheduling domains are concerned: there is only one scheduling
>> domain, with a different scheduling group for each vCPU inside it.
>
> Uuh, nearly.
>
> Your case won't deal correctly with NUMA, as the generic NUMA code is
> using set_sched_topology() as well. One of NUMA and Xen will win and
> overwrite the other's settings.
>
> To do things correctly you will have to handle NUMA as well.

One other thing I just discovered: there are other consumers of the
topology sibling masks (e.g. topology_sibling_cpumask()) as well.

I think we would want to avoid any optimizations based on those in
drivers as well, not only in the scheduler.


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/