linux-kernel - Re: [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield on MWAIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db756c13-eee5-414a-a28d-2ce08e7b77d9@amazon.de>
Date:   Mon, 18 Sep 2023 13:59:50 +0200
From:   Alexander Graf <graf@...zon.de>
To:     David Woodhouse <dwmw2@...radead.org>, <kvm@...r.kernel.org>,
        "Peter Zijlstra" <peterz@...radead.org>
CC:     Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
        Nicolas Saenz Julienne <nsaenz@...zon.es>,
        "Griffoul, Fred" <fgriffo@...zon.com>
Subject: Re: [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT,
 else yield on MWAIT


On 18.09.23 13:10, David Woodhouse wrote:
> On Mon, 2023-09-18 at 11:41 +0200, Alexander Graf wrote:
>> IIUC you want to do work in a user space vCPU thread when the guest vCPU
>> is idle. As you pointed out above, KVM can not actually do much about
>> MWAIT: It basically busy loops and hogs the CPU.
> Well.. I suspect what I *really* want is a decent way to emulate MWAIT
> properly and let it actually sleep. Or failing that, to declare that we
> can actually change the guest-visible experience when those guests are
> migrated to KVM, and take away MWAIT completely.
>
>> The typical flow I would expect for "work in a vCPU thread" is:
>>
>> 0) vCPU runs. HLT/MWAIT is directly exposed to guest.
>> 1) vCPU exits. Creates deferred work. Enables HLT/MWAIT trapping.
> That can happen, but it may also be a separate I/O thread which
> receives an eventfd notification and finds that there is now work to be
> done. If that work can be fairly much instantaneous, it can be done
> immediately. Else it gets deferred to what we Linux hackers might think
> of as a workqueue.
>
> If all the vCPUs are in HLT when the work queue becomes non-empty, we'd
> need to prod them *all* to change their exit-on-{HLT,MWAIT} status when
> work becomes available, just in case one of them becomes idle and can
> process the work "for free" using idle cycles.
>
>> 2) vCPU runs again
>> 3) vCPU calls HLT/MWAIT. We exit to user space to finish work from 1
>> 4) vCPU runs again without HLT/MWAIT trapping
>>
>> That means on top (or instead?) of the bits you have below that indicate
>> "Should I exit to user space?", what you really need are bits that do
>> what enable_cap(KVM_CAP_X86_DISABLE_EXITS) does in light-weight: Disable
>> HLT/MWAIT trapping temporarily.
> If I do it that way, yes. A lightweight way to enable/disable the exits
> even to kernel would be a nice to have. But it's a trade-off. For HLT
> you'd get lower latency re-entering the vCPU at a cost of much higher
> latency processing work if the vCPU was *already* in HLT.
>
> We probably would want to stop burning power in the MWAIT loop though,
> and let the pCPU sit in the guest in MWAIT if there really is nothing
> else to do.
>
> We're experimenting with various permutations.
>
>> Also, please keep in mind that you still would need a fallback mechanism
>> to run your "deferred work" even when the guest does not call HLT/MWAIT,
>> like a regular timer in your main thread.
> Yeah. In that case I think the ideal answer is that we let the kernel
> scheduler sort it out. I was thinking of a model where we have I/O (or
> workqueue) threads in *addition* to the userspace exits on idle. The
> separate threads own the work (and a number of them are woken according
> to the queue depth), and idle vCPUs *opportunistically* process work
> items on top of that.
>
> That approach alone would work fine with the existing HLT scheduling;
> it's just MWAIT which is a pain because yield() doesn't really do much
> (but as noted, it's better than *nothing*).
>
>> On top of all this, I'm not sure it's more efficient to do the trap to
>> the vCPU thread compared to just creating a separate real thread. Your
>> main problem is the emulatability of MWAIT because that leaves "no time"
>> to do deferred work. But then again, if your deferred work is so complex
>> that it needs more than a few ms (which you can always steal from the
>> vCPU thread, especiall with yield()), you'll need to start implementing
>> time slicing of that work in user space next - and basically rebuild
>> your own scheduler there. Ugh.
>>
>> IMHO the real core value of this idea would be in a vcpu_run bit that on
>> VCPU_RUN can toggle between HLT/MWAIT intercept on and off. The actual
>> trap to user space, you're most likely better off with a separate thread.
> No, that's very much not the point. The problem is that yield() doesn't
> work well enough — and isn't designed or guaranteed to do anything in
> particular for most cases. It's better than *nothing* but we want the
> opportunity to do the actual work right there in the *loop* of the
> guest bouncing through MWAIT.


The problem with MWAIT is that you don't really know when it's done.

You could find out by making MONITOR'ed pages(!) read-only so you can 
wake up any target vCPU that's in MWAIT, but that's considerably 
expensive if you want to do it well.

You could also burn one VM/system wide CPU that does nothing but waits 
for changes in any MONITOR'ed cache line. Doable with less power 
consumption if you use TSX I guess. But probably not what you want either.

Another alternative would be to make guests PV aware, so they understand 
you don't actually do MWAIT and give you a hypercall every time they 
modify whatever anyone would want to monitor (such as 
thread_info->flags). But that requires new guest kernels. I don't think 
you want to wait for that :).

So in a nutshell, emulating MWAIT properly is just super difficult. If 
you have even the remotest chance to get away with doing HLT instead, 
I'd take that. In that model, an I/O thread that schedules over idle 
threads becomes natural.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879