linux-kernel - Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55FF9A50.9040505@suse.com>
Date:	Mon, 21 Sep 2015 07:49:04 +0200
From:	Juergen Gross <jgross@...e.com>
To:	Dario Faggioli <dario.faggioli@...rix.com>
Cc:	"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>,
	Andrew Cooper <Andrew.Cooper3@...rix.com>,
	"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	George Dunlap <George.Dunlap@...rix.com>,
	David Vrabel <david.vrabel@...rix.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>
Subject: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling
 domain hierarchy

On 09/15/2015 06:50 PM, Dario Faggioli wrote:
> On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>> Hey everyone,
>>>
>>> So, as a followup of what we were discussing in this thread:
>>>
>>>    [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>    http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>
>>> I started looking in more details at scheduling domains in the Linux
>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>>> of interacting, while this thing I'm proposing here is completely
>>> independent from them both.
>>>
>>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>>> whether CPUID is reporting accurate, random, meaningful or completely
>>> misleading information, I think that we should do something about how
>>> scheduling domains are build.
>>>
>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>> lifetime) pinning, scheduling domains should not be constructed, in
>>> Linux, by looking at *any* topology information, because that just does
>>> not make any sense, when vcpus move around.
>>>
>>> Let me state this again (hoping to make myself as clear as possible): no
>>> matter in  how much good shape we put CPUID support, no matter how
>>> beautifully and consistently that will interact with both vNUMA,
>>> licensing requirements and whatever else. It will be always possible for
>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>> should really not skew his load balancing logic toward any of those two
>>> situations, as neither of them could be considered correct (since
>>> nothing is!).
>>>
>>> For now, this only covers the PV case. HVM case shouldn't be any
>>> different, but I haven't looked at how to make the same thing happen in
>>> there as well.
>>>
>>> OVERALL DESCRIPTION
>>> ===================
>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>> domains in such a way that there is only one of them, spanning all the
>>> pCPUs of the guest.
>>>
>>> Note that the patch deals directly with scheduling domains, and there is
>>> no need to alter the masks that will then be used for building and
>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>>> the main difference between it and the patch proposed by Juergen here:
>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>
>>> This means that when, in future, we will fix CPUID handling and make it
>>> comply with whatever logic or requirements we want, that won't have  any
>>> unexpected side effects on scheduling domains.
>>>
>>> Information about how the scheduling domains are being constructed
>>> during boot are available in `dmesg', if the kernel is booted with the
>>> 'sched_debug' parameter. It is also possible to look
>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>
>>> With the patch applied, only one scheduling domain is created, called
>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>> tell that from the fact that every cpu* folder
>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>> domain.
>>>
>>> EVALUATION
>>> ==========
>>> I've tested this with UnixBench, and by looking at Xen build time, on a
>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>> something similar to this in DomU already, AFAUI).
>>>
>>> I've run the benchmarks with and without the patch applied ('patched'
>>> and 'vanilla', respectively, in the tables below), and with different
>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>> the benchmarks (in the case of UnixBench).
>>>
>>> What I get from the numbers is that the patch almost always brings
>>> benefits, in some cases even huge ones. There are a couple of cases
>>> where we regress, but always only slightly so, especially if comparing
>>> that to the magnitude of some of the improvement that we get.
>>>
>>> Bear also in mind that these results are gathered from Dom0, and without
>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>> am expecting even better results.
>>>
>> ...
>>> REQUEST FOR COMMENTS
>>> ====================
>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>    - what you guys thing of the approach,
>>
>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>> myself) discussed this topic again.
>>
> Hey,
>
> Sorry for replying so late, I've been on vacation from right after
> XenSummit up until yesterday. :-)
>
>> Regarding a possible future scenario with credit2 eventually supporting
>> gang scheduling on hyperthreads (which is desirable due to security
>> reasons [side channel attack] and fairness) my patch seems to be more
>> suited for that direction than yours.
>>
> Ok. Just let me mention that 'Credit2 + gang scheduling' might not be
> exactly around the corner (although, we can prioritize working on it if
> we want).
>
> In principle, I think it's a really nice idea. I still don't have clear
> in mind how we would handle a couple of situations, but let's leave this
> aside for now, and stay on-topic.
>
>> Correct me if I'm wrong, but I
>> think scheduling domains won't enable the guest kernel's scheduler to
>> migrate threads more easily between hyperthreads opposed to other vcpus,
>> while my approach can easily be extended to do so.
>>
> I'm not sure I understand what you mean here. As far as the (Linux)
> scheduler is concerned, your patch and mine do the exact same thing:
> they arrange for the scheduling domains, when they're built, during
> boot, not to consider hyperthreads or multi-cores.
>
> Mine does it by removing the SMT (and the MC) level from the data
> structure in the scheduler that is used as a base for configuring the
> scheduling domains. Yours does it by making the topology bitmaps that
> are used at each one of those level all look the same. In fact, with
> your patch applied, I get the exact same situation as with mine, as far
> as scheduling domains are concerned: there is only one scheduling
> domain, with a different scheduling group for each vCPU inside it.

Uuh, nearly.

Your case won't deal correctly with NUMA, as the generic NUMA code is
using set_sched_topology() as well. One of NUMA and Xen will win and
overwrite the other's settings.

To do things correctly you will have to handle NUMA as well.


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/