[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YzKZExaU2k7qfcS9@gao-cwp>
Date: Tue, 27 Sep 2022 14:32:51 +0800
From: Chao Gao <chao.gao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<jon@...anix.com>, <kevin.tian@...el.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, <x86@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [RFC v2] KVM: x86/vmx: Suppress posted interrupt notification
when CPU is in host
On Mon, Sep 26, 2022 at 04:19:52PM +0000, Sean Christopherson wrote:
>On Fri, Sep 23, 2022, Chao Gao wrote:
>> Set PID.SN right after VM exits and clear it before VM entry to minimize
>> the chance of hardware issuing PINs to a CPU when it's in host.
>
>Toggling PID.SN as close to the world switch as possible is undesirable. If KVM
>re-enters the guest without enabling IRQs, i.e. handles the VM-Exit in the fastpath,
>then the notification IRQ will be delivered in the guest.
>
>The natural location to do the toggling is when KVM "toggles" software, i.e. when
>KVM sets IN_GUEST_MODE (clear SN) and OUTSIDE_GUEST_MODE (set SN).
This makes sense to me.
>
>I believe that would also obviate the need to manually send a PI Notification IRQ,
>as the existing ->sync_pir_to_irr() call that exists to handle exactly this case
>(notification not sent or handled in host) would ensure any outstanding posted IRQ
>gets moved to the IRR and processed accordingly.
>
>> Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,
>
>Uh uh, this patch is already way, way too subtle and complex to tack on clean up.
>"Opportunistic" clean up is for cases where the clean up is a pure refactoring
>and/or has zero impact on functionality.
Got it. Will move this cleanup to a separate patch if it is still needed.
>
>> it is pointless to update PID.NV to wakeup vector since notification is
>> anyway suppressed. And since PID.SN should be already set for running
>> vCPUs, so, don't set it again for preempted vCPUs.
>
>I'm pretty sure this is wrong. If the vCPU is preempted between prepare_to_rcuwait()
>and schedule(), then skipping pi_enable_wakeup_handler() will hang the guest if
>the wakeup event is a posted IRQ and the event arrives while the vCPU is preempted.
Thanks for pointing out this subtle case.
My understanding is finally there will be a call of vmx_vcpu_pi_put()
with preempted=false (I assume that preempted vCPUs will be scheduled
at some later point). In that case, pi_enable_wakeup_handler() can wake
up the vCPU by sending a self-ipi. Plus this patch checks PIR instead of
ON bit, I don't get why the guest will hang.
>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 6.56%, and PIN count in 1 second drops from tens of thousands to
>> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
>> VM-exit round-trip latency.
>
>The overhead is more than likely due to pi_is_pir_empty() in the VM-Entry path,
>i.e. should in theory go away if PID.SN is clear/set at IN_GUEST_MODE and
>OUTSIDE_GUEST_MODE
I will collect perf data after implementing what you suggested.
>
>> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
>
>Why?
VT-d hardware doesn't set ON bit if SN bit is set.
Enforce the same rule in KVM can skip unnecessary work, like the
following pi_test_and_set_on() and kvm_vcpu_trigger_posted_interrupt().
>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 6.56%, and PIN count in 1 second drops from tens of thousands to
>> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
>> VM-exit round-trip latency.
>>
>> [*] test cmd: perf bench sched pipe -T. Note that we change the source
>> code to pin two threads to two different vCPUs so that it can reproduce
>> stable results.
>>
>> Signed-off-by: Chao Gao <chao.gao@...el.com>
>> ---
>> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
>>
>> Changes in v2 (addressed comments from Kevin):
>> - measure/estimate the impact to non-IPC-intensive cases
>> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
>> right before VM-entry and set it after VM-exit.
>
>Ah, sorry, missed v1. Rather than key off of IN_GUEST_MODE in the sync path, add
>an explicit kvm_x86_ops hook to perform the transition. I.e. make it explict.
It is ok to add a separate hook. But the question is how to coordinate clearing
SN with ->sync_pir_to_irr(). Clearing SN bit may put PIR in a state where ON/SN
are cleared but some outstanding IRQs left there. Current ->sync_pir_to_irr()
doesn't sync those IRQs to IRR in this case. Here are two options to fix the
problem:
1) clear SN with the new hook, then set ON bit if there is any outstanding IRQ.
No change to ->sync_pir_to_irr() is needed.
2) clear SN with the new hook, add a force mode to ->sync_pir_to_irr() where
PIR is synced to IRR regardless of ON/SN bits, inovke ->sync_pir_to_irr()
on VM-entry path with force_mode=true.
Both may lead to an extra check of PIR.
>> @@ -101,11 +95,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>> new.control = old.control;
>>
>> /*
>> - * Clear SN (as above) and refresh the destination APIC ID to
>> - * handle task migration (@cpu != vcpu->cpu).
>> + * Set SN and refresh the destination APIC ID to handle
>> + * task migration (@cpu != vcpu->cpu).
>> + *
>> + * SN is cleared when a vCPU goes to blocked state so that
>> + * the blocked vCPU can be waken up on receiving a
>> + * notification. For a running/runnable vCPU, such
>> + * notifications are useless. Set SN bit to suppress them.
>> */
>> new.ndst = dest;
>> - new.sn = 0;
>> + new.sn = 1;
>
>To handle the preempted case, I believe the correct behavior is to leave SN
>as-is, although that would require setting SN=1 during vCPU creation. Arguably
>KVM should do that anyways when APICv is enabled.
>
>Hmm, or alternatively this should do the same?
>
> new.sn = !kvm_vcpu_is_blocking();
I don't get this. Probably I am misunderstanding something about vCPU preemption.
>
>> @@ -172,8 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>> * enabled until it is safe to call try_to_wake_up() on the task being
>> * scheduled out).
>> */
>> - if (pi_test_on(&new))
>> + if (!pi_is_pir_empty(pi_desc)) {
>> + pi_set_on(pi_desc);
>
>As much as I wish we could get rid of kvm_arch_vcpu_blocking(), I actually think
>this would be a good application of that hook. If PID.SN is cleared during
>kvm_arch_vcpu_blocking() and set during kvm_arch_vcpu_unblocking(), then I believe
>there's no need to manually check the PIR here, as any IRQ that isn't detected by
>kvm_vcpu_check_block() is guaranteed to set PID.ON=1.
Using kvm_arch_vcpu_blocking() has the same problem as using a new hook
for the VM-entry path: we need a force mode for ->sync_pir_to_irr() or
set ON bit if there is any outstanding IRQ right after clearing SN
The former may help performance a little but since the call of
->sync_pir_to_irr() in kvm_vcpu_check_block() is so far away from the
place where SN is cleared, I think this would be a source of bugs.
The latter has no benefit compared to what this patch does here.
Powered by blists - more mailing lists