linux-kernel - Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170404123915.GA9525@potion>
Date:   Tue, 4 Apr 2017 14:39:16 +0200
From:   Radim Krčmář <rkrcmar@...hat.com>
To:     Alexander Graf <agraf@...e.de>
Cc:     Jim Mattson <jmattson@...gle.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        "Gabriel L. Somlo" <gsomlo@...il.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        the arch/x86 maintainers <x86@...nel.org>,
        Joerg Roedel <joro@...tes.org>, kvm list <kvm@...r.kernel.org>,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-04-03 12:04+0200, Alexander Graf:
> On 03/29/2017 02:11 PM, Radim Krčmář wrote:
>> 2017-03-28 13:35-0700, Jim Mattson:
>> > On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@...hat.com> wrote:
>> > > 2017-03-27 15:34+0200, Alexander Graf:
>> > > > On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>> > > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> > > > > unless explicitly provided with kernel command line argument
>> > > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> > > > > without checking CPUID.
>> > > > > 
>> > > > > We currently emulate that as a NOP but on VMX we can do better: let
>> > > > > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>> > > > > but that isn't any worse than a NOP emulation.
>> > > > > 
>> > > > > Note that mwait within guests is not the same as on real hardware
>> > > > > because halt causes an exit while mwait doesn't.  For this reason it
>> > > > > might not be a good idea to use the regular MWAIT flag in CPUID to
>> > > > > signal this capability.  Add a flag in the hypervisor leaf instead.
>> > > > So imagine we had proper MWAIT emulation capabilities based on page faults.
>> > > > In that case, we could do something as fancy as
>> > > > 
>> > > > Treat MWAIT as pass-through by default
>> > > > 
>> > > > Have a per-vcpu monitor timer 10 times a second in the background that
>> > > > checks which instruction we're in
>> > > > 
>> > > > If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>> > > > if $IP was in non-mwait within that time, reset counter.
>> > > Or we could reuse external interrupts for sampling.  Exits trigerred by
>> > > them would check for current instruction (probably would be best to
>> > > limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>> > > would imply that MWAIT is not used.
>> > > 
>> > > > Or instead maybe just reuse the adapter hlt logic?
>> > > Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>> > > makes sense.  We would just add new wakeup methods.
>> > > 
>> > > > Either way, with that we should be able to get super low latency IPIs
>> > > > running while still maintaining some sanity on systems which don't have
>> > > > dedicated CPUs for workloads.
>> > > > 
>> > > > And we wouldn't need guest modifications, which is a great plus. So older
>> > > > guests (and Windows?) could benefit from mwait as well.
>> > > There is no need guest modifications -- it could be exposed as standard
>> > > MWAIT feature to the guest, with responsibilities for guest/host-impact
>> > > on the user.
>> > > 
>> > > I think that the page-fault based MWAIT would require paravirt if it
>> > > should be enabled by default, because of performance concerns:
>> > > Enabling write protection on a page needs a VM exit on all other VCPUs
>> > > when beginning monitoring (to reload page permissions and prevent missed
>> > > writes).
>> > > We'd want to keep trapping writes to the page all the time because
>> > > toggling is slow, but this could regress performance for an OS that has
>> > > other data accessed by other VCPUs in that page.
>> > > No current interface can tell the guest that it should reserve the whole
>> > > page instead of what CPUID[5] says and that writes to the monitored page
>> > > are not "cheap", but can trigger a VM exit ...
>> > CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>> > VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>> > when running Mac OS X guests. Per Intel's SDM volume 3, section
>> > 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>> > pad the data structure used to monitor writes. Software must make sure
>> > that beyond the data structure, no unrelated data variable exists in
>> > the triggering area for MWAIT. A pad may be needed to avoid this
>> > situation." Unfortunately, most operating systems do not follow this
>> > advice.
>> Right, EBX provides what we need to expose that the whole page is
>> monitored, thanks!
> 
> So coming back to the original patch, is there anything that should keep us
> from exposing MWAIT straight into the guest at all times?

Just minor issues:
 * OS X on Core 2 fails for unknown reason if we disable the instruction
   trapping, which is an argument against doing it by default
 * idling guests would consume host CPU, which is a significant change
   in behavior and shouldn't be done without userspace's involvement

I think the best compromise is to add a capability for the MWAIT VM-exit
controls and let userspace expose MWAIT if it wishes to.
Will send a patch.