linux-kernel - Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55E707E9.6000805@suse.com>
Date:	Wed, 2 Sep 2015 16:30:01 +0200
From:	Juergen Gross <jgross@...e.com>
To:	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Dario Faggioli <dario.faggioli@...rix.com>,
	"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>
Cc:	Andrew Cooper <Andrew.Cooper3@...rix.com>,
	"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
	David Vrabel <david.vrabel@...rix.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>,
	George Dunlap <George.Dunlap@...rix.com>
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain
 hierarchy

On 09/02/2015 04:08 PM, Boris Ostrovsky wrote:
> On 09/02/2015 07:58 AM, Juergen Gross wrote:
>> On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:
>>>
>>>
>>> On 08/20/2015 02:16 PM, Juergen Groß wrote:
>>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>>> Hey everyone,
>>>>>
>>>>> So, as a followup of what we were discussing in this thread:
>>>>>
>>>>>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>>>
>>>>>
>>>>>
>>>>> I started looking in more details at scheduling domains in the Linux
>>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird
>>>>> way
>>>>> of interacting, while this thing I'm proposing here is completely
>>>>> independent from them both.
>>>>>
>>>>> In fact, no matter whether vNUMA is supported and enabled, and no
>>>>> matter
>>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>>> misleading information, I think that we should do something about how
>>>>> scheduling domains are build.
>>>>>
>>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>>> Linux, by looking at *any* topology information, because that just
>>>>> does
>>>>> not make any sense, when vcpus move around.
>>>>>
>>>>> Let me state this again (hoping to make myself as clear as
>>>>> possible): no
>>>>> matter in  how much good shape we put CPUID support, no matter how
>>>>> beautifully and consistently that will interact with both vNUMA,
>>>>> licensing requirements and whatever else. It will be always
>>>>> possible for
>>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>>> should really not skew his load balancing logic toward any of those
>>>>> two
>>>>> situations, as neither of them could be considered correct (since
>>>>> nothing is!).
>>>>>
>>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>>> different, but I haven't looked at how to make the same thing
>>>>> happen in
>>>>> there as well.
>>>>>
>>>>> OVERALL DESCRIPTION
>>>>> ===================
>>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>>> domains in such a way that there is only one of them, spanning all the
>>>>> pCPUs of the guest.
>>>>>
>>>>> Note that the patch deals directly with scheduling domains, and
>>>>> there is
>>>>> no need to alter the masks that will then be used for building and
>>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.).
>>>>> That is
>>>>> the main difference between it and the patch proposed by Juergen here:
>>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>>>
>>>>>
>>>>>
>>>>> This means that when, in future, we will fix CPUID handling and
>>>>> make it
>>>>> comply with whatever logic or requirements we want, that won't
>>>>> have  any
>>>>> unexpected side effects on scheduling domains.
>>>>>
>>>>> Information about how the scheduling domains are being constructed
>>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>>> 'sched_debug' parameter. It is also possible to look
>>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>>
>>>>> With the patch applied, only one scheduling domain is created, called
>>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>>> tell that from the fact that every cpu* folder
>>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>>> domain.
>>>>>
>>>>> EVALUATION
>>>>> ==========
>>>>> I've tested this with UnixBench, and by looking at Xen build time,
>>>>> on a
>>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>>> something similar to this in DomU already, AFAUI).
>>>>>
>>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>>> and 'vanilla', respectively, in the tables below), and with different
>>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>>> the benchmarks (in the case of UnixBench).
>>>>>
>>>>> What I get from the numbers is that the patch almost always brings
>>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>>> where we regress, but always only slightly so, especially if comparing
>>>>> that to the magnitude of some of the improvement that we get.
>>>>>
>>>>> Bear also in mind that these results are gathered from Dom0, and
>>>>> without
>>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>>> am expecting even better results.
>>>>>
>>>> ...
>>>>> REQUEST FOR COMMENTS
>>>>> ====================
>>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>>   - what you guys thing of the approach,
>>>>
>>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>>> myself) discussed this topic again.
>>>>
>>>> Regarding a possible future scenario with credit2 eventually supporting
>>>> gang scheduling on hyperthreads (which is desirable due to security
>>>> reasons [side channel attack] and fairness) my patch seems to be more
>>>> suited for that direction than yours. Correct me if I'm wrong, but I
>>>> think scheduling domains won't enable the guest kernel's scheduler to
>>>> migrate threads more easily between hyperthreads opposed to other
>>>> vcpus,
>>>> while my approach can easily be extended to do so.
>>>>
>>>>>   - whether you think, looking at this preliminary set of numbers,
>>>>> that
>>>>>     this is something worth continuing investigating,
>>>>
>>>> I believe as both approaches lead to the same topology information used
>>>> by the scheduler (all vcpus are regarded as being equal) your numbers
>>>> should apply to my patch as well. Would you mind verifying this?
>>>
>>> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
>>> both of your patches?
>>
>> Hmm, sort of.
>>
>> OTOH this would it make hard to make use of some of the topology
>> information in case of e.g. pinned vcpus (as George pointed out).
>
>
> I didn't mean to just set has_mp to zero unconditionally (for Xen, or
> any other, guest). We'd need to have some logic as to when to set it to
> false.

In case we want to be able to use some of the topology information this
would mean we'd have two different mechanisms to either disable all
topology usage or only parts of it. I'd rather have a way to specify
which levels of the topology information (numa nodes, cache siblings,
core siblings) are to be used. Using none is just one possibility with
all levels disabled.


Juergen

>>
>>> Also, it seems to me that Xen guests would not be the only ones having
>>> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
>>> guests, for example, have the same problem? And if yes, perhaps we
>>> should try solving it in non-Xen-specific way (especially given that
>>> both of those patches look pretty simple and thus are presumably easy to
>>> integrate into common code).
>>
>> Indeed. I'll have a try.
>>
>>> And, as George already pointed out, this should be an optional feature
>>> --- if a guest spans physical nodes and VCPUs are pinned then we don't
>>> always want flat topology/domains.
>>
>> Yes, it might be a good idea to be able to keep some of the topology
>> levels. I'll modify my patch to make this command line selectable.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/