linux-kernel - Re: [RFC v2] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YzKZExaU2k7qfcS9@gao-cwp>
Date:   Tue, 27 Sep 2022 14:32:51 +0800
From:   Chao Gao <chao.gao@...el.com>
To:     Sean Christopherson <seanjc@...gle.com>
CC:     <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <jon@...anix.com>, <kevin.tian@...el.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [RFC v2] KVM: x86/vmx: Suppress posted interrupt notification
 when CPU is in host

On Mon, Sep 26, 2022 at 04:19:52PM +0000, Sean Christopherson wrote:
>On Fri, Sep 23, 2022, Chao Gao wrote:
>> Set PID.SN right after VM exits and clear it before VM entry to minimize
>> the chance of hardware issuing PINs to a CPU when it's in host.
>
>Toggling PID.SN as close to the world switch as possible is undesirable.  If KVM
>re-enters the guest without enabling IRQs, i.e. handles the VM-Exit in the fastpath,
>then the notification IRQ will be delivered in the guest.
>
>The natural location to do the toggling is when KVM "toggles" software, i.e. when
>KVM sets IN_GUEST_MODE (clear SN) and OUTSIDE_GUEST_MODE (set SN).

This makes sense to me.

>
>I believe that would also obviate the need to manually send a PI Notification IRQ,
>as the existing ->sync_pir_to_irr() call that exists to handle exactly this case
>(notification not sent or handled in host) would ensure any outstanding posted IRQ
>gets moved to the IRR and processed accordingly.
>
>> Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,
>
>Uh uh, this patch is already way, way too subtle and complex to tack on clean up.
>"Opportunistic" clean up is for cases where the clean up is a pure refactoring
>and/or has zero impact on functionality.

Got it. Will move this cleanup to a separate patch if it is still needed.

>
>> it is pointless to update PID.NV to wakeup vector since notification is
>> anyway suppressed. And since PID.SN should be already set for running
>> vCPUs, so, don't set it again for preempted vCPUs.
>
>I'm pretty sure this is wrong.  If the vCPU is preempted between prepare_to_rcuwait()
>and schedule(), then skipping pi_enable_wakeup_handler() will hang the guest if
>the wakeup event is a posted IRQ and the event arrives while the vCPU is preempted.

Thanks for pointing out this subtle case.

My understanding is finally there will be a call of vmx_vcpu_pi_put()
with preempted=false (I assume that preempted vCPUs will be scheduled
at some later point). In that case, pi_enable_wakeup_handler() can wake
up the vCPU by sending a self-ipi. Plus this patch checks PIR instead of
ON bit, I don't get why the guest will hang.

>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 6.56%, and PIN count in 1 second drops from tens of thousands to
>> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
>> VM-exit round-trip latency.
>
>The overhead is more than likely due to pi_is_pir_empty() in the VM-Entry path,
>i.e. should in theory go away if PID.SN is clear/set at IN_GUEST_MODE and
>OUTSIDE_GUEST_MODE

I will collect perf data after implementing what you suggested.

>
>> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
>
>Why?

VT-d hardware doesn't set ON bit if SN bit is set.

Enforce the same rule in KVM can skip unnecessary work, like the
following pi_test_and_set_on() and kvm_vcpu_trigger_posted_interrupt().

>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 6.56%, and PIN count in 1 second drops from tens of thousands to
>> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
>> VM-exit round-trip latency.
>> 
>> [*] test cmd: perf bench sched pipe -T. Note that we change the source
>> code to pin two threads to two different vCPUs so that it can reproduce
>> stable results.
>> 
>> Signed-off-by: Chao Gao <chao.gao@...el.com>
>> ---
>> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
>> 
>> Changes in v2 (addressed comments from Kevin):
>> - measure/estimate the impact to non-IPC-intensive cases
>> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
>>   right before VM-entry and set it after VM-exit.
>
>Ah, sorry, missed v1.  Rather than key off of IN_GUEST_MODE in the sync path, add
>an explicit kvm_x86_ops hook to perform the transition.  I.e. make it explict.

It is ok to add a separate hook. But the question is how to coordinate clearing
SN with ->sync_pir_to_irr(). Clearing SN bit may put PIR in a state where ON/SN
are cleared but some outstanding IRQs left there. Current ->sync_pir_to_irr()
doesn't sync those IRQs to IRR in this case. Here are two options to fix the
problem:

1) clear SN with the new hook, then set ON bit if there is any outstanding IRQ.
No change to ->sync_pir_to_irr() is needed.

2) clear SN with the new hook, add a force mode to ->sync_pir_to_irr() where
PIR is synced to IRR regardless of ON/SN bits, inovke ->sync_pir_to_irr()
on VM-entry path with force_mode=true.

Both may lead to an extra check of PIR.

>> @@ -101,11 +95,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>>  		new.control = old.control;
>>  
>>  		/*
>> -		 * Clear SN (as above) and refresh the destination APIC ID to
>> -		 * handle task migration (@cpu != vcpu->cpu).
>> +		 * Set SN and refresh the destination APIC ID to handle
>> +		 * task migration (@cpu != vcpu->cpu).
>> +		 *
>> +		 * SN is cleared when a vCPU goes to blocked state so that
>> +		 * the blocked vCPU can be waken up on receiving a
>> +		 * notification. For a running/runnable vCPU, such
>> +		 * notifications are useless. Set SN bit to suppress them.
>>  		 */
>>  		new.ndst = dest;
>> -		new.sn = 0;
>> +		new.sn = 1;
>
>To handle the preempted case, I believe the correct behavior is to leave SN
>as-is, although that would require setting SN=1 during vCPU creation.  Arguably
>KVM should do that anyways when APICv is enabled.
>
>Hmm, or alternatively this should do the same?
>
>		new.sn = !kvm_vcpu_is_blocking();

I don't get this. Probably I am misunderstanding something about vCPU preemption.

>
>> @@ -172,8 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>>  	 * enabled until it is safe to call try_to_wake_up() on the task being
>>  	 * scheduled out).
>>  	 */
>> -	if (pi_test_on(&new))
>> +	if (!pi_is_pir_empty(pi_desc)) {
>> +		pi_set_on(pi_desc);
>
>As much as I wish we could get rid of kvm_arch_vcpu_blocking(), I actually think
>this would be a good application of that hook.  If PID.SN is cleared during
>kvm_arch_vcpu_blocking() and set during kvm_arch_vcpu_unblocking(), then I believe
>there's no need to manually check the PIR here, as any IRQ that isn't detected by
>kvm_vcpu_check_block() is guaranteed to set PID.ON=1.

Using kvm_arch_vcpu_blocking() has the same problem as using a new hook
for the VM-entry path: we need a force mode for ->sync_pir_to_irr() or
set ON bit if there is any outstanding IRQ right after clearing SN

The former may help performance a little but since the call of
->sync_pir_to_irr() in kvm_vcpu_check_block() is so far away from the
place where SN is cleared, I think this would be a source of bugs.

The latter has no benefit compared to what this patch does here.