linux-kernel - Re: [PATCH 2/2] x86/idle: use dynamic halt poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 3 Jul 2017 17:28:42 +0800
From:   Yang Zhang <yang.zhang.wz@...il.com>
To:     Radim Krčmář <rkrcmar@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Cc:     Wanpeng Li <kernellwp@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        the arch/x86 maintainers <x86@...nel.org>,
        Jonathan Corbet <corbet@....net>, tony.luck@...el.com,
        Borislav Petkov <bp@...en8.de>,
        Peter Zijlstra <peterz@...radead.org>, mchehab@...nel.org,
        Andrew Morton <akpm@...ux-foundation.org>, krzk@...nel.org,
        jpoimboe@...hat.com, Andy Lutomirski <luto@...nel.org>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Thomas Garnier <thgarnie@...gle.com>,
        Robert Gerst <rgerst@...il.com>,
        Mathias Krause <minipli@...glemail.com>,
        douly.fnst@...fujitsu.com, Nicolai Stange <nicstange@...il.com>,
        Frederic Weisbecker <fweisbec@...il.com>, dvlasenk@...hat.com,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        yamada.masahiro@...ionext.com, mika.westerberg@...ux.intel.com,
        Chen Yu <yu.c.chen@...el.com>, aaron.lu@...el.com,
        Steven Rostedt <rostedt@...dmis.org>,
        Kyle Huey <me@...ehuey.com>, Len Brown <len.brown@...el.com>,
        Prarit Bhargava <prarit@...hat.com>,
        hidehiro.kawai.ez@...achi.com, fengtiantian@...wei.com,
        pmladek@...e.com, jeyu@...hat.com, Larry.Finger@...inger.net,
        zijun_hu@....com, luisbg@....samsung.com, johannes.berg@...el.com,
        niklas.soderlund+renesas@...natech.se, zlpnobody@...il.com,
        Alexey Dobriyan <adobriyan@...il.com>, fgao@...ai8.com,
        ebiederm@...ssion.com,
        Subash Abhinov Kasiviswanathan <subashab@...eaurora.org>,
        Arnd Bergmann <arnd@...db.de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Mel Gorman <mgorman@...hsingularity.net>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        linux-doc@...r.kernel.org, linux-edac@...r.kernel.org,
        kvm <kvm@...r.kernel.org>
Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll

On 2017/6/27 22:22, Radim Krčmář wrote:
> 2017-06-27 15:56+0200, Paolo Bonzini:
>> On 27/06/2017 15:40, Radim Krčmář wrote:
>>>> ... which is not necessarily _wrong_.  It's just a different heuristic.
>>> Right, it's just harder to use than host's single_task_running() -- the
>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
>>> at other VCPUs that are not halted, but still preempted.
>>>
>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
>>> yield to the host.  Working under the assumption that there is work for
>>> this PCPU if other VCPUs have stuff to do.  The downside is that it
>>> misses information about host's topology, so it would be hard to make it
>>> work well.
>>
>> I would just use vcpu_is_preempted on the current CPU.  From guest POV
>> this option is really a "f*** everyone else" setting just like
>> idle=poll, only a little more polite.
>
> vcpu_is_preempted() on current cpu cannot return true, AFAIK.
>
>> If we've been preempted and we were polling, there are two cases.  If an
>> interrupt was queued while the guest was preempted, the poll will be
>> treated as successful anyway.
>
> I think the poll should be treated as invalid if the window has expired
> while the VCPU was preempted -- the guest can't tell whether the
> interrupt arrived still within the poll window (unless we added paravirt
> for that), so it shouldn't be wasting time waiting for it.
>
>>                                If it hasn't, let others run---but really
>> that's not because the guest wants to be polite, it's to avoid that the
>> scheduler penalizes it excessively.
>
> This sounds like a VM entry just to do an immediate VM exit, so paravirt
> seems better here as well ... (the guest telling the host about its
> window -- which could also be used to rule it out as a target in the
> pause loop random kick.)
>
>> So until it's preempted, I think it's okay if the guest doesn't care
>> about others.  You wouldn't use this option anyway in overcommitted
>> situations.
>>
>> (I'm still not very convinced about the idea).
>
> Me neither.  (The same mechanism is applicable to bare-metal, but was
> never used there, so I would rather bring the guest behavior closer to
> bare-metal.)
>

The background is that we(Alibaba Cloud) do get more and more complaints 
from our customers in both KVM and Xen compare to bare-mental.After 
investigations, the root cause is known to us: big cost in message 
passing workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0                             vcpu 1
1. send ipi                     2.  doing hlt
3. go into idle                 4.  receive ipi and wake up from hlt
5. write APIC time twice        6.  write APIC time twice to
    to stop sched timer              reprogram sched timer
7. doing hlt                    8.  handle task and send ipi to
                                     vcpu 0
9. same to 4.                   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). 
The cost of such vmexits will degrades performance severely. Linux 
kernel already provide idle=poll to mitigate the trend. But it only 
eliminates the IPI and hlt vmexit. It has nothing to do with start/stop 
sched timer. A compromise would be to turn off NOHZ kernel, but it is 
not the default config for new distributions. Same for halt-poll in KVM, 
it only solve the cost from schedule in/out in host and can not help 
such workload much.

The purpose of this patch we want to improve current idle=poll mechanism 
to use dynamic polling and do poll before touch sched timer. It should 
not be a virtualization specific feature but seems bare mental have low 
cost to access the MSR. So i want to only enable it in VM. Though the 
idea below the patch may not so perfect to fit all conditions, it looks 
no worse than now.
How about we keep current implementation and i integrate the patch to 
para-virtualize part as Paolo suggested? We can continue discuss it and 
i will continue to refine it if anyone has better suggestions?


-- 
Yang
Alibaba Cloud Computing