linux-kernel - Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4274f9be-1c3d-4246-abe9-69c4d8ca8964@oracle.com>
Date: Tue, 24 Sep 2024 17:59:39 -0400
From: boris.ostrovsky@...cle.com
To: Igor Mammedov <imammedo@...hat.com>
Cc: Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Eric Mackay <eric.mackay@...cle.com>
Subject: Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM



On 9/24/24 5:40 AM, Igor Mammedov wrote:
> On Fri, 19 Apr 2024 12:17:01 -0400
> boris.ostrovsky@...cle.com wrote:
> 
>> On 4/17/24 9:58 AM, boris.ostrovsky@...cle.com wrote:
>>>
>>> I noticed that I was using a few months old qemu bits and now I am
>>> having trouble reproducing this on latest bits. Let me see if I can get
>>> this to fail with latest first and then try to trace why the processor
>>> is in this unexpected state.
>>
>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
>>
>> I need to understand whether lack of failures is a side effect of timing
>> changes that simply make hotplug fail less likely or if this is an
>> actual (but seemingly unintentional) fix.
> 
> Agreed, we should find out culprit of the problem.


I haven't been able to spend much time on this unfortunately, Eric is 
now starting to look at this again.

One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
vcpus serially while on HW my understanding is that this is done as a 
broadcast so I thought this could cause a race. I had a quick test with 
pausing and resuming all vcpus around the loop but that didn't help.


> 
> PS:
> also if you are using AMD host, there was a regression in OVMF
> where where vCPU that OSPM was already online-ing, was yanked
> from under OSMP feet by OVMF (which depending on timing could
> manifest as lost SIPI).
> 
> edk2 commit that should fix it is:
>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
> 
> Switching to Intel host should rule that out at least.
> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> if you are forced to use AMD host)

I just tried with latest bits that include this commit and still was 
able to reproduce the problem.


-boris