linux-kernel - Re: [PATCH 0/2] KVM: SVM: Fix unexpected #UD on INT3 in SEV guests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d183c3f2-d94d-5f22-184d-eab80f9d0fe8@amd.com>
Date:   Wed, 23 Aug 2023 11:18:05 -0500
From:   Tom Lendacky <thomas.lendacky@....com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Wu Zongyo <wuzongyo@...l.ustc.edu.cn>
Subject: Re: [PATCH 0/2] KVM: SVM: Fix unexpected #UD on INT3 in SEV guests

On 8/22/23 10:14, Sean Christopherson wrote:
> On Tue, Aug 22, 2023, Tom Lendacky wrote:
>> On 8/10/23 18:49, Sean Christopherson wrote:
>>> Fix a bug where KVM injects a bogus #UD for SEV guests when trying to skip
>>> an INT3 as part of re-injecting the associated #BP that got kinda sorta
>>> intercepted due to a #NPF occuring while vectoring/delivering the #BP.
>>>
>>> I haven't actually confirmed that patch 1 fixes the bug, as it's a
>>> different change than what I originally proposed.  I'm 99% certain it will
>>> work, but I definitely need verification that it fixes the problem
>>>
>>> Patch 2 is a tangentially related cleanup to make NRIPS a requirement for
>>> enabling SEV, e.g. so that we don't ever get "bug" reports of SEV guests
>>> not working when NRIPS is disabled.
>>>
>>> Sean Christopherson (2):
>>>     KVM: SVM: Don't inject #UD if KVM attempts emulation of SEV guest w/o
>>>       insn
>>>     KVM: SVM: Require nrips support for SEV guests (and beyond)
>>>
>>>    arch/x86/kvm/svm/sev.c |  2 +-
>>>    arch/x86/kvm/svm/svm.c | 37 ++++++++++++++++++++-----------------
>>>    arch/x86/kvm/svm/svm.h |  1 +
>>>    3 files changed, 22 insertions(+), 18 deletions(-)
>>
>> We ran some stress tests against a version of the kernel without this fix
>> and we're able to reproduce the issue, but not reliably, after a few hours.
>> With this patch, it has not reproduced after running for a week.
>>
>> Not as reliable a scenario as the original reporter, but this looks like it
>> resolves the issue.
> 
> Thanks Tom!  I'll apply this for v6.6, that'll give us plenty of time to change
> course if necessary.

I may have spoke to soon...  When the #UD was triggered it was here:

[    0.118524] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    0.118524] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.118524] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
[    0.118524] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    0.118524] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.2-amdsos-build50-ubuntu-20.04+ #1
[    0.118524] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    0.118524] RIP: 0010:int3_selftest_ip+0x0/0x60
[    0.118524] Code: b9 25 05 00 00 48 c7 c2 e8 7c 80 b0 48 c7 c6 fe 1c d3 b0 48 c7 c7 f0 7d da b0 e8 4c 2c 0b ff e8 75 da 15 ff 0f 0b 48 8d 7d f4 <cc> 90 90 90 90 83 7d f4 01 74 2f 80 3d 39 7f a8 00 00 74 24 b9 34


Now (after about a week) we've encountered a hang here:

[    0.106216] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    0.106216] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.106216] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl

It is in the very same spot and so I wonder if the return false (without
queuing a #UD) is causing an infinite loop here that appears as a guest
hang. Whereas, we have some systems running the first patch that you
created that have not hit this hang.

But I'm not sure why or how this patch could cause the guest hang. I
would think that the retry of the instruction would resolve everything
and the guest would continue. Unfortunately, the guest was killed, so I'll
try to reproduce and get a dump or trace points of the VM to see what is
going on.

Thanks,
Tom