linux-kernel - Re: [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield on MWAIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63b382bf-d1fb-464f-ab06-4185f796a85f@amazon.de>
Date:   Mon, 18 Sep 2023 11:41:06 +0200
From:   Alexander Graf <graf@...zon.de>
To:     David Woodhouse <dwmw2@...radead.org>, <kvm@...r.kernel.org>,
        "Peter Zijlstra" <peterz@...radead.org>
CC:     Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
        Nicolas Saenz Julienne <nsaenz@...zon.es>,
        "Griffoul, Fred" <fgriffo@...zon.com>
Subject: Re: [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield
 on MWAIT


On 18.09.23 11:06, David Woodhouse wrote:
> From: David Woodhouse <dwmw@...zon.co.uk>
>
> The VMM may have work to do on behalf of the guest, and it's often
> desirable to use the cycles when the vCPUS are idle.
>
> When the vCPU uses HLT this works out OK because the VMM can run its
> tasks in a separate thread which gets scheduled when the in-kernel
> emulation of HLT schedules away. It isn't perfect, because it doesn't
> easily allow for handling both low-priority maintenance tasks when the
> VMM wants to wait until the vCPU is idle, and also for higher priority
> tasks where the VMM does want to preempt the vCPU. It can also lead to
> noisy neighbour effects, when a host has isn't necessarily sized to
> expect any given VMM to suddenly be contending for many *more* pCPUs
> than it has vCPUs.
>
> In addition, there are times when we need to expose MWAIT to a guest
> for compatibility with a previous environment. And MWAIT is much harder
> because it's very hard to emulate properly.
>
> There were attempts at doing so based on marking the target page read-
> only in MONITOR and triggering the wake when it takes a minor fault,
> but so far they haven't led to a working solution:
> https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/mwait.html
>
> So when a guest executes MWAIT, either we've disabled exit-on-mwait and
> the guest actually sits in non-root mode hogging the pCPU, or if we do
> enable exit-on-mwait the kernel just treats it as a NOP and bounces
> right back into the guest to busy-wait round its idle loop.
>
> For a start, we can stick a yield() into that busy-loop. The yield()
> has fairly poorly defined semantics, but it's better than *nothing* and
> does allow a VMM's thread-based I/O and maintenance tasks to run a
> *little* better.
>
> Better still, we can bounce all the way out to *userspace* on an MWAIT
> exit, and let the VMM perform some of its pending work right there and
> then in the vCPU thread before re-entering the vCPU. That's much nicer
> than yield(). The vCPU is still runnable, since we still don't have a
> *real* emulation of MWAIT, so the vCPU thread can do a *little* bit of
> work and then go back into the vCPU for another turn around the loop.
>
> And if we're going to do that kind of task processing for MWAIT-idle
> guests directly from the vCPU thread, it's neater to do it for HLT-idle
> guests that way too.
>
> For HLT, the vCPU *isn't* runnable; it'll be in KVM_MP_STATE_HALTED.
> The VMM can poll the mp_state and know when the vCPU should be run
> again. But not poll(), although we might want to hook up something like
> that (or just a signal or eventfd) for other reasons for VSM anyway.
> The VMM can also just do some work and then re-enter the vCPU without
> the corresponding bit set in the kvm_run struct.
>
> So, er, what does this patch do? Add a capability, define two bits for
> exiting to userspace on HLT or MWAIT — in the kvm_run struct rather
> than needing a separate ioctl to turn them on or off, so that the VMM
> can make the decision each time it enters the vCPU. Hook it up to
> (ab?)use the existing KVM_EXIT_HLT which was previously only used when
> the local APIC was emulated in userspace, and add a new KVM_EXIT_MWAIT.
>
> Fairly much untested.
>
> If this approach seems reasonable, of course I'll add test cases and
> proper documentation before posting it for real. This is the proof of
> concept before we even put it through testing to see what performance
> we get out of it especially for those obnoxious MWAIT-enabled guests.
>
> Signed-off-by: David Woodhouse <dwmw@...zon.co.uk>


IIUC you want to do work in a user space vCPU thread when the guest vCPU 
is idle. As you pointed out above, KVM can not actually do much about 
MWAIT: It basically busy loops and hogs the CPU.

The typical flow I would expect for "work in a vCPU thread" is:

0) vCPU runs. HLT/MWAIT is directly exposed to guest.
1) vCPU exits. Creates deferred work. Enables HLT/MWAIT trapping.
2) vCPU runs again
3) vCPU calls HLT/MWAIT. We exit to user space to finish work from 1
4) vCPU runs again without HLT/MWAIT trapping

That means on top (or instead?) of the bits you have below that indicate 
"Should I exit to user space?", what you really need are bits that do 
what enable_cap(KVM_CAP_X86_DISABLE_EXITS) does in light-weight: Disable 
HLT/MWAIT trapping temporarily.

Also, please keep in mind that you still would need a fallback mechanism 
to run your "deferred work" even when the guest does not call HLT/MWAIT, 
like a regular timer in your main thread.

On top of all this, I'm not sure it's more efficient to do the trap to 
the vCPU thread compared to just creating a separate real thread. Your 
main problem is the emulatability of MWAIT because that leaves "no time" 
to do deferred work. But then again, if your deferred work is so complex 
that it needs more than a few ms (which you can always steal from the 
vCPU thread, especiall with yield()), you'll need to start implementing 
time slicing of that work in user space next - and basically rebuild 
your own scheduler there. Ugh.

IMHO the real core value of this idea would be in a vcpu_run bit that on 
VCPU_RUN can toggle between HLT/MWAIT intercept on and off. The actual 
trap to user space, you're most likely better off with a separate thread.


Alex


>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a6582c1fd8b9..8f931539114a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2128,9 +2128,23 @@ static int kvm_emulate_monitor_mwait(struct kvm_vcpu *vcpu, const char *insn)
>   	pr_warn_once("%s instruction emulated as NOP!\n", insn);
>   	return kvm_emulate_as_nop(vcpu);
>   }
> +
>   int kvm_emulate_mwait(struct kvm_vcpu *vcpu)
>   {
> -	return kvm_emulate_monitor_mwait(vcpu, "MWAIT");
> +	int ret = kvm_emulate_monitor_mwait(vcpu, "MWAIT");
> +
> +	if (ret && kvm_userspace_exit(vcpu, KVM_EXIT_MWAIT)) {
> +		vcpu->run->exit_reason = KVM_EXIT_MWAIT;
> +		ret = 0;
> +	} else {
> +		/*
> +		 * Calling yield() has poorly defined semantics, but the
> +		 * guest is in a busy loop and it's the best we can do
> +		 * without a full emulation of MONITOR/MWAIT.
> +		 */
> +		yield();
> +	}
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(kvm_emulate_mwait);
>   
> @@ -4554,6 +4568,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   				r |= KVM_X86_DISABLE_EXITS_MWAIT;
>   		}
>   		break;
> +	case KVM_CAP_X86_USERSPACE_EXITS:
> +		r = KVM_X86_USERSPACE_VALID_EXITS;
> +		break;
>   	case KVM_CAP_X86_SMM:
>   		if (!IS_ENABLED(CONFIG_KVM_SMM))
>   			break;
> @@ -9643,11 +9660,11 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
>   	++vcpu->stat.halt_exits;
>   	if (lapic_in_kernel(vcpu)) {
>   		vcpu->arch.mp_state = state;
> -		return 1;
> -	} else {
> -		vcpu->run->exit_reason = reason;
> -		return 0;
> +		if (!kvm_userspace_exit(vcpu, reason))
> +			return 1;
>   	}
> +	vcpu->run->exit_reason = reason;
> +	return 0;
>   }
>   
>   int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..ce10a809151c 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -430,6 +430,19 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
>   	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
>   }
>   
> +static inline bool kvm_userspace_exit(struct kvm_vcpu *vcpu, int reason)
> +{
> +	if (reason == KVM_EXIT_HLT &&
> +	    (vcpu->run->userspace_exits & KVM_X86_USERSPACE_EXIT_HLT))
> +		return true;
> +
> +	if (reason == KVM_EXIT_MWAIT &&
> +	    (vcpu->run->userspace_exits & KVM_X86_USERSPACE_EXIT_MWAIT))
> +		return true;
> +
> +	return false;
> +}
> +
>   enum kvm_intr_type {
>   	/* Values are arbitrary, but must be non-zero. */
>   	KVM_HANDLING_IRQ = 1,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13065dd96132..43d94d49fc24 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MWAIT            38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -283,7 +284,8 @@ struct kvm_run {
>   	/* in */
>   	__u8 request_interrupt_window;
>   	__u8 immediate_exit;
> -	__u8 padding1[6];
> +	__u8 userspace_exits;
> +	__u8 padding1[5];
>   
>   	/* out */
>   	__u32 exit_reason;
> @@ -841,6 +843,11 @@ struct kvm_ioeventfd {
>                                                 KVM_X86_DISABLE_EXITS_PAUSE | \
>                                                 KVM_X86_DISABLE_EXITS_CSTATE)
>   
> +#define KVM_X86_USERSPACE_EXIT_MWAIT	     (1 << 0)
> +#define KVM_X86_USERSPACE_EXIT_HLT	     (1 << 1)
> +#define KVM_X86_USERSPACE_VALID_EXITS        (KVM_X86_USERSPACE_EXIT_MWAIT | \
> +                                              KVM_X86_USERSPACE_EXIT_HLT)
> +
>   /* for KVM_ENABLE_CAP */
>   struct kvm_enable_cap {
>   	/* in */
> @@ -1192,6 +1199,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_COUNTER_OFFSET 227
>   #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>   #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> +#define KVM_CAP_X86_USERSPACE_EXITS 230
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
>



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879