linux-kernel - Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 1 Sep 2017 14:21:53 +0800
From:   Yang Zhang <yang.zhang.wz@...il.com>
To:     Alexander Graf <agraf@...e.de>, linux-kernel@...r.kernel.org
Cc:     kvm@...r.kernel.org, wanpeng.li@...mail.com, mst@...hat.com,
        pbonzini@...hat.com, tglx@...utronix.de, rkrcmar@...hat.com,
        dmatlack@...gle.com, peterz@...radead.org,
        linux-doc@...r.kernel.org
Subject: Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

On 2017/8/29 19:58, Alexander Graf wrote:
> On 08/29/2017 01:46 PM, Yang Zhang wrote:
>> Some latency-intensive workload will see obviously performance
>> drop when running inside VM. The main reason is that the overhead
>> is amplified when running inside VM. The most cost i have seen is
>> inside idle path.
>>
>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
>>
>> Here is the data we get when running benchmark contextswitch to measure
>> the latency(lower is better):
>>
>>     1. w/o patch:
>>        2493.14 ns/ctxsw -- 200.3 %CPU
>>     2. w/ patch:
>>        halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
>>        halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
>>        halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
>>        halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
>>     3. kvm dynamic poll
>>        halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
>>        halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
>>        halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
>>        halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
>>     4. idle=poll
>>        2050.1 ns/ctxsw -- 1003 %CPU
>>     5. idle=mwait
>>        2188.06 ns/ctxsw -- 206.3 %CPU
> 
> Could you please try to create another metric for guest initiated, host 
> aborted mwait?
> 
> For a quick benchmark, reserve 4 registers for a magic value, set them 
> to the magic value before you enter MWAIT in the guest. Then allow 
> native MWAIT execution on the host. If you see the guest wants to enter 

I guess you want to allow native MWAIT execution on the guest not host?

> with the 4 registers containing the magic contents and no events are 
> pending, directly go into the vcpu block function on the host.

Mmm..It is not very clear to me. If guest executes MWAIT without vmexit, 
how to check the register?

> 
> That way any time a guest gets naturally aborted while in mwait, it will 
> only reenter mwait when an event actually occured. While the guest is 
> normally running (and nobody else wants to run on the host), we just 
> stay in guest context, but with a sleeping CPU.
> 
> Overall, that might give us even better performance, as it allows for 
> turbo boost and HT to work properly.

In our testing,  we have enough cores(32cores) but only 10VCPUs, so in 
the best case, we may see the same performance as poll.

-- 
Yang
Alibaba Cloud Computing