[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55E702E7.6070709@oracle.com>
Date: Wed, 02 Sep 2015 10:08:39 -0400
From: Boris Ostrovsky <boris.ostrovsky@...cle.com>
To: Juergen Gross <jgross@...e.com>,
Dario Faggioli <dario.faggioli@...rix.com>,
"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>
CC: Andrew Cooper <Andrew.Cooper3@...rix.com>,
"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
David Vrabel <david.vrabel@...rix.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Stefano Stabellini <stefano.stabellini@...citrix.com>,
George Dunlap <George.Dunlap@...rix.com>
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
On 09/02/2015 07:58 AM, Juergen Gross wrote:
> On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:
>>
>>
>> On 08/20/2015 02:16 PM, Juergen Groß wrote:
>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>> Hey everyone,
>>>>
>>>> So, as a followup of what we were discussing in this thread:
>>>>
>>>> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>>
>>>>
>>>>
>>>> I started looking in more details at scheduling domains in the Linux
>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird
>>>> way
>>>> of interacting, while this thing I'm proposing here is completely
>>>> independent from them both.
>>>>
>>>> In fact, no matter whether vNUMA is supported and enabled, and no
>>>> matter
>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>> misleading information, I think that we should do something about how
>>>> scheduling domains are build.
>>>>
>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>> Linux, by looking at *any* topology information, because that just
>>>> does
>>>> not make any sense, when vcpus move around.
>>>>
>>>> Let me state this again (hoping to make myself as clear as
>>>> possible): no
>>>> matter in how much good shape we put CPUID support, no matter how
>>>> beautifully and consistently that will interact with both vNUMA,
>>>> licensing requirements and whatever else. It will be always
>>>> possible for
>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>> should really not skew his load balancing logic toward any of those
>>>> two
>>>> situations, as neither of them could be considered correct (since
>>>> nothing is!).
>>>>
>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>> different, but I haven't looked at how to make the same thing
>>>> happen in
>>>> there as well.
>>>>
>>>> OVERALL DESCRIPTION
>>>> ===================
>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>> domains in such a way that there is only one of them, spanning all the
>>>> pCPUs of the guest.
>>>>
>>>> Note that the patch deals directly with scheduling domains, and
>>>> there is
>>>> no need to alter the masks that will then be used for building and
>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.).
>>>> That is
>>>> the main difference between it and the patch proposed by Juergen here:
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>>
>>>>
>>>>
>>>> This means that when, in future, we will fix CPUID handling and
>>>> make it
>>>> comply with whatever logic or requirements we want, that won't
>>>> have any
>>>> unexpected side effects on scheduling domains.
>>>>
>>>> Information about how the scheduling domains are being constructed
>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>> 'sched_debug' parameter. It is also possible to look
>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>
>>>> With the patch applied, only one scheduling domain is created, called
>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>> tell that from the fact that every cpu* folder
>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>> domain.
>>>>
>>>> EVALUATION
>>>> ==========
>>>> I've tested this with UnixBench, and by looking at Xen build time,
>>>> on a
>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>> something similar to this in DomU already, AFAUI).
>>>>
>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>> and 'vanilla', respectively, in the tables below), and with different
>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>> the benchmarks (in the case of UnixBench).
>>>>
>>>> What I get from the numbers is that the patch almost always brings
>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>> where we regress, but always only slightly so, especially if comparing
>>>> that to the magnitude of some of the improvement that we get.
>>>>
>>>> Bear also in mind that these results are gathered from Dom0, and
>>>> without
>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>> am expecting even better results.
>>>>
>>> ...
>>>> REQUEST FOR COMMENTS
>>>> ====================
>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>> - what you guys thing of the approach,
>>>
>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>> myself) discussed this topic again.
>>>
>>> Regarding a possible future scenario with credit2 eventually supporting
>>> gang scheduling on hyperthreads (which is desirable due to security
>>> reasons [side channel attack] and fairness) my patch seems to be more
>>> suited for that direction than yours. Correct me if I'm wrong, but I
>>> think scheduling domains won't enable the guest kernel's scheduler to
>>> migrate threads more easily between hyperthreads opposed to other
>>> vcpus,
>>> while my approach can easily be extended to do so.
>>>
>>>> - whether you think, looking at this preliminary set of numbers,
>>>> that
>>>> this is something worth continuing investigating,
>>>
>>> I believe as both approaches lead to the same topology information used
>>> by the scheduler (all vcpus are regarded as being equal) your numbers
>>> should apply to my patch as well. Would you mind verifying this?
>>
>> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
>> both of your patches?
>
> Hmm, sort of.
>
> OTOH this would it make hard to make use of some of the topology
> information in case of e.g. pinned vcpus (as George pointed out).
I didn't mean to just set has_mp to zero unconditionally (for Xen, or
any other, guest). We'd need to have some logic as to when to set it to
false.
-boris
>
>> Also, it seems to me that Xen guests would not be the only ones having
>> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
>> guests, for example, have the same problem? And if yes, perhaps we
>> should try solving it in non-Xen-specific way (especially given that
>> both of those patches look pretty simple and thus are presumably easy to
>> integrate into common code).
>
> Indeed. I'll have a try.
>
>> And, as George already pointed out, this should be an optional feature
>> --- if a guest spans physical nodes and VCPUs are pinned then we don't
>> always want flat topology/domains.
>
> Yes, it might be a good idea to be able to keep some of the topology
> levels. I'll modify my patch to make this command line selectable.
>
>
> Juergen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists