linux-kernel - Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <abcfc109-296d-ec8e-2f4a-4f55f6a1b632@suse.de>
Date:   Tue, 4 Apr 2017 14:51:06 +0200
From:   Alexander Graf <agraf@...e.de>
To:     Radim Krčmář <rkrcmar@...hat.com>
Cc:     Jim Mattson <jmattson@...gle.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        "Gabriel L. Somlo" <gsomlo@...il.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        the arch/x86 maintainers <x86@...nel.org>,
        Joerg Roedel <joro@...tes.org>, kvm list <kvm@...r.kernel.org>,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On 04/04/2017 02:39 PM, Radim Krčmář wrote:
> 2017-04-03 12:04+0200, Alexander Graf:
>> On 03/29/2017 02:11 PM, Radim Krčmář wrote:
>>> 2017-03-28 13:35-0700, Jim Mattson:
>>>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@...hat.com> wrote:
>>>>> 2017-03-27 15:34+0200, Alexander Graf:
>>>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>>>>> unless explicitly provided with kernel command line argument
>>>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>>>>> without checking CPUID.
>>>>>>>
>>>>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>>>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>>>>>> but that isn't any worse than a NOP emulation.
>>>>>>>
>>>>>>> Note that mwait within guests is not the same as on real hardware
>>>>>>> because halt causes an exit while mwait doesn't.  For this reason it
>>>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>>>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>>>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>>>>> In that case, we could do something as fancy as
>>>>>>
>>>>>> Treat MWAIT as pass-through by default
>>>>>>
>>>>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>>>>> checks which instruction we're in
>>>>>>
>>>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>>>>> if $IP was in non-mwait within that time, reset counter.
>>>>> Or we could reuse external interrupts for sampling.  Exits trigerred by
>>>>> them would check for current instruction (probably would be best to
>>>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>>>>> would imply that MWAIT is not used.
>>>>>
>>>>>> Or instead maybe just reuse the adapter hlt logic?
>>>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>>>>> makes sense.  We would just add new wakeup methods.
>>>>>
>>>>>> Either way, with that we should be able to get super low latency IPIs
>>>>>> running while still maintaining some sanity on systems which don't have
>>>>>> dedicated CPUs for workloads.
>>>>>>
>>>>>> And we wouldn't need guest modifications, which is a great plus. So older
>>>>>> guests (and Windows?) could benefit from mwait as well.
>>>>> There is no need guest modifications -- it could be exposed as standard
>>>>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>>>>> on the user.
>>>>>
>>>>> I think that the page-fault based MWAIT would require paravirt if it
>>>>> should be enabled by default, because of performance concerns:
>>>>> Enabling write protection on a page needs a VM exit on all other VCPUs
>>>>> when beginning monitoring (to reload page permissions and prevent missed
>>>>> writes).
>>>>> We'd want to keep trapping writes to the page all the time because
>>>>> toggling is slow, but this could regress performance for an OS that has
>>>>> other data accessed by other VCPUs in that page.
>>>>> No current interface can tell the guest that it should reserve the whole
>>>>> page instead of what CPUID[5] says and that writes to the monitored page
>>>>> are not "cheap", but can trigger a VM exit ...
>>>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>>>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>>>> when running Mac OS X guests. Per Intel's SDM volume 3, section
>>>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>>>> pad the data structure used to monitor writes. Software must make sure
>>>> that beyond the data structure, no unrelated data variable exists in
>>>> the triggering area for MWAIT. A pad may be needed to avoid this
>>>> situation." Unfortunately, most operating systems do not follow this
>>>> advice.
>>> Right, EBX provides what we need to expose that the whole page is
>>> monitored, thanks!
>> So coming back to the original patch, is there anything that should keep us
>> from exposing MWAIT straight into the guest at all times?
> Just minor issues:
>   * OS X on Core 2 fails for unknown reason if we disable the instruction
>     trapping, which is an argument against doing it by default

So for that we should try and see if changing the exposed CPUID MWAIT 
leaf helps. Currently we return 0/0 which is pretty bogus and might be 
the reason OSX fails.

>   * idling guests would consume host CPU, which is a significant change
>     in behavior and shouldn't be done without userspace's involvement

That's the same as today, as idling guests with MWAIT would also today 
end up in a NOP emulated loop.

Please bear in mind that I do not advocate to expose the MWAIT CPUID 
flag. This is only for the instruction trap.

> I think the best compromise is to add a capability for the MWAIT VM-exit
> controls and let userspace expose MWAIT if it wishes to.
> Will send a patch.


Please see my patch to force enable CPUID bits ;).



Alex