[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250807165950.14953-1-kim.phillips@amd.com>
Date: Thu, 7 Aug 2025 11:59:49 -0500
From: Kim Phillips <kim.phillips@....com>
To: <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
<linux-coco@...ts.linux.dev>, <x86@...nel.org>
CC: Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dave Hansen
<dave.hansen@...ux.intel.com>, Sean Christopherson <seanjc@...gle.com>,
"Paolo Bonzini" <pbonzini@...hat.com>, Ingo Molnar <mingo@...hat.com>, "H.
Peter Anvin" <hpa@...or.com>, Thomas Gleixner <tglx@...utronix.de>, K Prateek
Nayak <kprateek.nayak@....com>, "Nikunj A . Dadhania" <nikunj@....com>, "Tom
Lendacky" <thomas.lendacky@....com>, Michael Roth <michael.roth@....com>,
Ashish Kalra <ashish.kalra@....com>, Borislav Petkov
<borislav.petkov@....com>, Borislav Petkov <bp@...en8.de>, Nathan Fontenot
<nathan.fontenot@....com>, Dhaval Giani <Dhaval.Giani@....com>, "Santosh
Shukla" <santosh.shukla@....com>, Naveen Rao <naveen.rao@....com>, "Gautham R
. Shenoy" <gautham.shenoy@....com>, Ananth Narayan <ananth.narayan@....com>,
Pankaj Gupta <pankaj.gupta@....com>, David Kaplan <david.kaplan@....com>,
"Jon Grimm" <Jon.Grimm@....com>, Kim Phillips <kim.phillips@....com>
Subject: [RFC PATCH 0/1] KVM: SEV: Add support for SMT Protection
On an SMT-enabled system, the SMT Protection feature allows an
SNP guest to demand its hardware vCPU thread to run alone on
the physical core. It will opt to do this to protect itself
against possible side channel attacks from shared core resources.
Hardware supports this by enforcing the sibling of the vCPU thread
to be in the idle state when the vCPU is running: If hardware detects
the sibling has not entered the idle state, or it exited it, then
the vCPU VMRUN exits with a new "IDLE_REQUIRED" status, where the
hypervisor should schedule the idle process on the sibling thread
simultaneously with resuming the vCPU VMRUN.
There is a new HLT_WAKEUP_ICR MSR that the hypervisor programs
for each system SMT thread such that if an idle sibling of a
SMT Protected guest vCPU receives an interrupt, hardware will write
the HLT_WAKEUP_ICR value to the APIC ICR to 'kick' the vCPU
thread out of its VMRUN state. Hardware then allows the sibling
to then exit the idle state and service its interrupt.
The feature is supported on EYPC Zen 4 and above CPUs.
For more information, see "15.36.17 Side-Channel Protection",
"SMT Protection", in:
"AMD64 Architecture Programmer's Manual Volume 2: System Programming Part 2,
Pub. 24593 Rev. 3.42 - March 2024"
available here:
https://bugzilla.kernel.org/attachment.cgi?id=306250
See the end of this message for the qemu hack that calls the
Linux Core Scheduler prctl syscall to create a unique per-vCPU
cookie to ensure the vCPU process will not be scheduled if
there is anything else running on the sibling thread of the
core.
As it turns out, this approach is less than efficient because
existing Core Scheduling semantics only prevent other userspace
processes from running on the sibling thread that hardware requires
to be in the idle state.
Because of this, the sibling CPU VMRUN frequently exits with
"IDLE_REQUIRED" when the scheduler runs its "OS noise" (softirq
work, etc.) instead of forcing the hardware idle state throughout
the duration of the VMRUN.
Mild testing yields eventual CPU stalls in the guest (minutes after
boot):
[ C0] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ C0] rcu: 1-...!: (0 ticks this GP) idle=8d58/0/0x0 softirq=12830/12830 fqs=0 (false positive?)
[ C0] rcu: (detected by 0, t=16253 jiffies, g=12377, q=12 ncpus=2)
[ C0] rcu: rcu_preempt kthread timer wakeup didn't happen for 16252 jiffies! g12377 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ C0] rcu: Possible timer handling issue on cpu=1 timer-softirq=15006
[ C0] rcu: rcu_preempt kthread starved for 16253 jiffies! g12377 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ C0] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
..with the occasional "NOHZ tick-stop error: local softirq work is
pending, handler #200!!!" on the host.
However, this RFC represents only one of three approaches attempted:
- Another brute-force approach simply called remove_cpu() on the sibling
before, and add_cpu() after __svm_sev_es_vcpu_run() in
svm_vcpu_enter_exit(). The effort was quickly abandoned since
it led to insurmountable lock contention issues:
BUG: scheduling while atomic: qemu-system-x86/6743/0x00000002
4 locks held by qemu-system-x86/6743:
#0: ff160079b2dd80b8 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x94/0xa40 [kvm]
#1: ffffffffba3c5410 (device_hotplug_lock){....}-{3:3}, at: lock_device_hotplug+0x1b/0x30
#2: ff16009838ff5398 (&dev->mutex){....}-{3:3}, at: device_offline+0x9c/0x120
#3: ffffffffb9e7e6b0 (cpu_add_remove_lock){....}-{3:3}, at: cpu_device_down+0x24/0x50
- The third approach attempted to forward port vCPU Core Scheduling
from the original 4.18 based work by Peter Z.:
https://github.com/pdxChen/gang/commits/sched_1.23-base
K. Prateek Nayak provided enough guidance to get me past host lockups
from "kvm,sched: Track VCPU threads", but the following "sched: Add VCPU
aware SMT scheduling" commit proved insurmountable to forward-port
given the complex changes to scheduler internals since then.
Comments welcome:
- Are any of these three approaches even close to an
upstream-acceptable solution to support SMT Protection?
- Given the feature's strict sibling idle state constraints,
should SMT Protection even be supported at all?
This RFC applies to kvm-x86/next kvm-x86-next-2025.07.21 (33f843444e28).
Qemu hack:
>From 0278a4078933d9bce16a8e80f415466b44244a59 Mon Sep 17 00:00:00 2001
From: Kim Phillips <kim.phillips@....com>
Date: Wed, 2 Apr 2025 16:02:50 -0500
Subject: [RFC PATCH] system/cpus: Affine and Core-Schedule vCPUs onto pCPUs
DO NOT MERGE.
Hack to experiment supporting SEV-SNP "SMT Protection" feature. It:
1. Affines vCPUs to individual core pCPUs (as cpu_index increments
over single-core threads 1, 2, etc.),
2. Calls the Linux Core Scheduler prctl syscall to create a per-vCPU
unique cookie to ensure the vCPU process will not be scheduled
if there is anything else on the sibling thread of the pCPU core.
Note: It contains POSIX-specific code that really belongs in
util/qemu-thread-posix.c, and other hackery.
Signed-off-by: Kim Phillips <kim.phillips@....com>
---
accel/kvm/kvm-accel-ops.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/accel/kvm/kvm-accel-ops.c b/accel/kvm/kvm-accel-ops.c
index c239dfc87a..4b853d3024 100644
--- a/accel/kvm/kvm-accel-ops.c
+++ b/accel/kvm/kvm-accel-ops.c
@@ -26,9 +26,12 @@
#include <linux/kvm.h>
#include "kvm-cpus.h"
+#include <sys/prctl.h> /* PR_SCHED_CORE_CREATE */
+
static void *kvm_vcpu_thread_fn(void *arg)
{
CPUState *cpu = arg;
+ cpu_set_t cpuset;
int r;
rcu_register_thread();
@@ -38,6 +41,16 @@ static void *kvm_vcpu_thread_fn(void *arg)
cpu->thread_id = qemu_get_thread_id();
current_cpu = cpu;
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpu->cpu_index, &cpuset);
+ pthread_setaffinity_np(cpu->thread->thread, sizeof(cpu_set_t), &cpuset);
+
+ r = prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, 0, 0);
+ if (r) {
+ printf("%s %d: CORE CREATE ret %d \r\n", __func__, __LINE__, r);
+ exit(1);
+ }
+
r = kvm_init_vcpu(cpu, &error_fatal);
kvm_init_cpu_signals(cpu);
--
2.43.0
Kim Phillips (1):
KVM: SEV: Add support for SMT Protection
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/svm.h | 1 +
arch/x86/include/uapi/asm/svm.h | 1 +
arch/x86/kvm/svm/sev.c | 17 +++++++++++++++++
arch/x86/kvm/svm/svm.c | 3 +++
6 files changed, 24 insertions(+)
base-commit: 33f843444e28920d6e624c6c24637b4bb5d3c8de
--
2.43.0
Kim Phillips (1):
KVM: SEV: Add support for SMT Protection
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/svm.h | 1 +
arch/x86/include/uapi/asm/svm.h | 1 +
arch/x86/kvm/svm/sev.c | 17 +++++++++++++++++
arch/x86/kvm/svm/svm.c | 3 +++
6 files changed, 24 insertions(+)
base-commit: 33f843444e28920d6e624c6c24637b4bb5d3c8de
--
2.43.0
Powered by blists - more mailing lists