linux-kernel - Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z7-3K-CXnoqHhmgC@google.com>
Date: Wed, 26 Feb 2025 16:51:55 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Maxim Levitsky <mlevitsk@...hat.com>
Cc: James Houghton <jthoughton@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, Marc Zyngier <maz@...nel.org>, 
	Oliver Upton <oliver.upton@...ux.dev>, Wei Xu <weixugc@...gle.com>, Yu Zhao <yuzhao@...gle.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

On Wed, Feb 26, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> > On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > What if we make the assertion user controllable?  I.e. let the user opt-out (or
> > off-by-default and opt-in) via command line?  We did something similar for the
> > rseq test, because the test would run far fewer iterations than expected if the
> > vCPU task was migrated to CPU(s) in deep sleep states.
> > 
> > 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> > 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> > 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> > 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> > 		    "  or run with -u to disable this sanity check.", i);
> > 
> > This is quite similar, because as you say, it's impractical for the test to account
> > for every possible environmental quirk.
> 
> No objections in principle, especially if sanity check is skipped by default, 
> although this does sort of defeats the purpose of the check. 
> I guess that the check might still be used for developers.

A middle ground would be to enable the check by default if NUMA balancing is off.
We can always revisit the default setting if it turns out there are other problematic
"features".

> > > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > > all of guest memory.  If that's what's happening, then a better solution for the
> > > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > > pin it to a single pCPU?).
> > > 
> > > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > > persists.  On 6.13, the only way to make the test consistently work is to
> > > disable NUMA balancing.
> > 
> > Well that's odd.  While I'm quite curious as to what's happening,

Gah, chatting about this offline jogged my memory.  NUMA balancing doesn't zap
(mark PROT_NONE/PROT_NUMA) PTEs for paging the kernel thinks are being accessed
remotely, it zaps PTEs to see if they're are being accessed remotely.  So yeah,
whenever NUMA balancing kicks in, the guest will see a large amount of its memory
get re-faulted.

Which is why it's such a terribly feature to pair with KVM, at least as-is.  NUMA
balancing is predicated on inducing and resolving the #PF being relatively cheap,
but that doesn't hold true for secondary MMUs due to the coarse nature of mmu_notifiers.