linux-kernel - Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56025657.1050904@suse.com>
Date:	Wed, 23 Sep 2015 09:35:51 +0200
From:	Juergen Gross <jgross@...e.com>
To:	Dario Faggioli <dario.faggioli@...rix.com>
Cc:	"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>,
	Andrew Cooper <Andrew.Cooper3@...rix.com>,
	"Luis R. Rodriguez" <mcgrof@...not-panic.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	George Dunlap <George.Dunlap@...rix.com>,
	David Vrabel <david.vrabel@...rix.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>
Subject: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling
 domain hierarchy

On 09/23/2015 09:24 AM, Dario Faggioli wrote:
> On Mon, 2015-09-21 at 07:49 +0200, Juergen Gross wrote:
>> On 09/15/2015 06:50 PM, Dario Faggioli wrote:
>>> On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
>>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>>> Hey everyone,
>>>>>
>>>>> So, as a followup of what we were discussing in this thread:
>>>>>
>>>>>     [Xen-devel] PV-vNUMA issue: topology is misinterpreted by
>>>>> the guest
>>>>>     http://lists.xenproject.org/archives/html/xen-devel/2015-07/
>>>>> msg03241.html
>>>>>
>>>>> I started looking in more details at scheduling domains in the
>>>>> Linux
>>>>> kernel. Now, that thread was about CPUID and vNUMA, and their
>>>>> weird way
>>>>> of interacting, while this thing I'm proposing here is
>>>>> completely
>>>>> independent from them both.
>>>>>
>>>>> In fact, no matter whether vNUMA is supported and enabled, and
>>>>> no matter
>>>>> whether CPUID is reporting accurate, random, meaningful or
>>>>> completely
>>>>> misleading information, I think that we should do something
>>>>> about how
>>>>> scheduling domains are build.
>>>>>
>>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>>> lifetime) pinning, scheduling domains should not be
>>>>> constructed, in
>>>>> Linux, by looking at *any* topology information, because that
>>>>> just does
>>>>> not make any sense, when vcpus move around.
>>>>>
>>>>> Let me state this again (hoping to make myself as clear as
>>>>> possible): no
>>>>> matter in  how much good shape we put CPUID support, no matter
>>>>> how
>>>>> beautifully and consistently that will interact with both
>>>>> vNUMA,
>>>>> licensing requirements and whatever else. It will be always
>>>>> possible for
>>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time
>>>>> t1, and
>>>>> on two different NUMA nodes at time t2. Hence, the Linux
>>>>> scheduler
>>>>> should really not skew his load balancing logic toward any of
>>>>> those two
>>>>> situations, as neither of them could be considered correct
>>>>> (since
>>>>> nothing is!).
>>>>>
>>>>> For now, this only covers the PV case. HVM case shouldn't be
>>>>> any
>>>>> different, but I haven't looked at how to make the same thing
>>>>> happen in
>>>>> there as well.
>>>>>
>>>>> OVERALL DESCRIPTION
>>>>> ===================
>>>>> What this RFC patch does is, in the Xen PV case, configure
>>>>> scheduling
>>>>> domains in such a way that there is only one of them, spanning
>>>>> all the
>>>>> pCPUs of the guest.
>>>>>
>>>>> Note that the patch deals directly with scheduling domains, and
>>>>> there is
>>>>> no need to alter the masks that will then be used for building
>>>>> and
>>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs,
>>>>> etc.). That is
>>>>> the main difference between it and the patch proposed by
>>>>> Juergen here:
>>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg
>>>>> 05088.html
>>>>>
>>>>> This means that when, in future, we will fix CPUID handling and
>>>>> make it
>>>>> comply with whatever logic or requirements we want, that won't
>>>>> have  any
>>>>> unexpected side effects on scheduling domains.
>>>>>
>>>>> Information about how the scheduling domains are being
>>>>> constructed
>>>>> during boot are available in `dmesg', if the kernel is booted
>>>>> with the
>>>>> 'sched_debug' parameter. It is also possible to look
>>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>>
>>>>> With the patch applied, only one scheduling domain is created,
>>>>> called
>>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs.
>>>>> You can
>>>>> tell that from the fact that every cpu* folder
>>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>>> ('domain0'), with all the tweaks and the tunables for our
>>>>> scheduling
>>>>> domain.
>>>>>
>>>>> EVALUATION
>>>>> ==========
>>>>> I've tested this with UnixBench, and by looking at Xen build
>>>>> time, on a
>>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0
>>>>> only, for
>>>>> now, but I plan to re-run them in DomUs soon (Juergen may be
>>>>> doing
>>>>> something similar to this in DomU already, AFAUI).
>>>>>
>>>>> I've run the benchmarks with and without the patch applied
>>>>> ('patched'
>>>>> and 'vanilla', respectively, in the tables below), and with
>>>>> different
>>>>> number of build jobs (in case of the Xen build) or of parallel
>>>>> copy of
>>>>> the benchmarks (in the case of UnixBench).
>>>>>
>>>>> What I get from the numbers is that the patch almost always
>>>>> brings
>>>>> benefits, in some cases even huge ones. There are a couple of
>>>>> cases
>>>>> where we regress, but always only slightly so, especially if
>>>>> comparing
>>>>> that to the magnitude of some of the improvement that we get.
>>>>>
>>>>> Bear also in mind that these results are gathered from Dom0,
>>>>> and without
>>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr
>>>>> pCPUs). If
>>>>> we move things in DomU and do overcommit at the Xen scheduler
>>>>> level, I
>>>>> am expecting even better results.
>>>>>
>>>> ...
>>>>> REQUEST FOR COMMENTS
>>>>> ====================
>>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>>     - what you guys thing of the approach,
>>>>
>>>> Yesterday at the end of the developer meeting we (Andrew, Elena
>>>> and
>>>> myself) discussed this topic again.
>>>>
>>> Hey,
>>>
>>> Sorry for replying so late, I've been on vacation from right after
>>> XenSummit up until yesterday. :-)
>>>
>>>> Regarding a possible future scenario with credit2 eventually
>>>> supporting
>>>> gang scheduling on hyperthreads (which is desirable due to
>>>> security
>>>> reasons [side channel attack] and fairness) my patch seems to be
>>>> more
>>>> suited for that direction than yours.
>>>>
>>> Ok. Just let me mention that 'Credit2 + gang scheduling' might not
>>> be
>>> exactly around the corner (although, we can prioritize working on
>>> it if
>>> we want).
>>>
>>> In principle, I think it's a really nice idea. I still don't have
>>> clear
>>> in mind how we would handle a couple of situations, but let's leave
>>> this
>>> aside for now, and stay on-topic.
>>>
>>>> Correct me if I'm wrong, but I
>>>> think scheduling domains won't enable the guest kernel's
>>>> scheduler to
>>>> migrate threads more easily between hyperthreads opposed to other
>>>> vcpus,
>>>> while my approach can easily be extended to do so.
>>>>
>>> I'm not sure I understand what you mean here. As far as the (Linux)
>>> scheduler is concerned, your patch and mine do the exact same
>>> thing:
>>> they arrange for the scheduling domains, when they're built, during
>>> boot, not to consider hyperthreads or multi-cores.
>>>
>>> Mine does it by removing the SMT (and the MC) level from the data
>>> structure in the scheduler that is used as a base for configuring
>>> the
>>> scheduling domains. Yours does it by making the topology bitmaps
>>> that
>>> are used at each one of those level all look the same. In fact,
>>> with
>>> your patch applied, I get the exact same situation as with mine, as
>>> far
>>> as scheduling domains are concerned: there is only one scheduling
>>> domain, with a different scheduling group for each vCPU inside it.
>>
>> Uuh, nearly.
>>
>> Your case won't deal correctly with NUMA, as the generic NUMA code is
>> using set_sched_topology() as well.
>>
> Mmm... have you tried and seen something like this? AFAICT, the NUMA
> related setup steps of scheduling domains happens after the basic (as
> in "without taking NUMAness into account") topology has been set
> already, and builds on top of it.
>
> It uses set_sched_topology() only in a special case which, I'm not sure
> we'd be hitting.

Depends on the hardware. On some AMD processors one socket covers
multiple NUMA nodes. This is the critical case. set_sched_topology()
will be called on those machines possibly multiple times when bringing
up additional cpus.

> I'm asking because trying this out, right now, is not straightforward,
> as PV vNUMA, even with Wei's Linux patches and with either yours or
> mine one, still incurs in the CPUID issue... I'll try that ASAP, but
> there are a couple of things I've got to finish for the next few days.
>
>> One of NUMA and Xen will win and
>> overwrite the other's settings.
>>
> Not sure what this means, but as I said, I'll try.

Make sure to use the correct hardware (I'm pretty sure this should be
the AMD "Magny-Cours" [1]).


Juergen

[1]: 
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/