linux-kernel - Re: [PATCH v7 3/3] KVM: SVM: Flush cache only on CPUs running SEV guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <Z7-qtIjS0bdc5J0r@google.com>
Date: Wed, 26 Feb 2025 23:58:44 +0000
From: Sean Christopherson <seanjc@...gle.com>
To: Zheyun Shen <szy0127@...u.edu.cn>
Cc: Tom Lendacky <thomas.lendacky@....com>, pbonzini@...hat.com, tglx@...utronix.de, 
	Kevin Loughlin <kevinloughlin@...gle.com>, mingo@...hat.com, bp@...en8.de, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7 3/3] KVM: SVM: Flush cache only on CPUs running SEV guest

On Wed, Feb 26, 2025, Zheyun Shen wrote:
> I'm very sorry that the formatting of my previous email was messed up due to
> an issue with the email client. I am sending a new email with the same
> content.

Something is still not quite right, as your mails aren't hitting lore.kernel.org,
i.e. are getting dropped by the lists.

> > Sean Christopherson <seanjc@...gle.com> wrote：
> > 
> > On Tue, Jan 28, 2025, Zheyun Shen wrote:
> > Changed the time of recording the CPUs from pre_sev_run() to vcpu_load().
> > 
> > Why?  If there's a good reason, then that absolutely, positively belongs in the
> > changelog and in the code as a comment.  If there's no good reason, then...
> > 
> The reason I moved the timing of CPU recording from pre_sev_run(） to
> vcpu_load() is that I found vcpu_load() is always present in the call path of
> kvm_arch_vcpu_ioctl_run(). Moreover, whenever a vCPU migration occurs, the
> control flow will reach vcpu_load() again to ensure the correctness of CPU
> recording. On the other hand, recording information in pre_sev_run() would
> result in recording the CPU number every time before entering the guest.
> Without vCPU migration, only the first time to record is effective and the
> subsequent records are redundant and thus waste time. This would result in
> each VM exit taking longer (although the additional time may be very short).

So long as KVM performs a lockless test to see if the CPU, the cost should be
minimal.  This:

	if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
		cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);

Generates a bt+jb to guard the locked bts, i.e. *should* only add 1-2 uops to the
entry flow, once CPU's predictors warmed up enough to not prefetch the RMW and
bound the cache line.

   0x0000000000016603 <+67>:	bt     %r8,0xa428(%rdx)
   0x000000000001660b <+75>:	jb     0x16616 <pre_sev_run+86>
   0x000000000001660d <+77>:	lock bts %r8,0xa428(%rdx)

If it turns out that marking a CPU as having run that ASID becomes a hot spot,
then I'm definitely open to moving things around.  But for a first go at this,
and especially without evidence that it's problematic, I want to go with the
approach that's most obviously correct.

> > Unless I hear otherwise, my plan is to move this back to pre_sev_run().
> > 
> 
> Another issue in the v3 version is that I incorrectly cleared the recorded
> mask after each cache flushing. The mask bits should be cleared and changed
> at the time of vCPU migration rather than after a cache flushing.

The bits can't be cleared at vCPU migration, they can only be cleared when a
flush is issued.  A vCPU that did VMRUN with an ASID in the past could still
have dirty data cached for that ASID.

KVM could send an IPI to the previous CPU to do a cache flush on migration,
similar to what KVM already does on VMX to VMCLEAR the VMCS.  But the math and
usage for WB(NO)INVD is wildly different.  For VMCLEAR, the IPI approach is the
lazy approach; KVM *must* VMCLEAR the VMCS if the CPU changes, using an IPI to
defer VMCLEAR allows KVM to skip doing VMCLEAR whenever a vCPU is scheduled out.

For WB(NO)INVD, the IPI approach would be an eager approach.  Deferring WB(NO)INVD
until it's necessary allows KVM to skip the WB(NO)INVD when a vCPU is migrated.

In short, no optimization is obviously better than another at this point.  E.g.
if a VM is not hard pinned 1:1, but vCPUs are affined to certain cores, then
flushing on migration is a terrible idea because the total set of CPUs that need
to flush caches is more or less static.

Anyways, as above, I definitely want to go with the simplest implementation
(make the bitmap "sticky"), and then iterate, optimize, and add complexity if and
only if it's truly justified.