linux-kernel - Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <07788b85473e24627131ffe1a8d1d01856dd9cb5.camel@redhat.com>
Date: Tue, 25 Feb 2025 17:00:00 -0500
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: James Houghton <jthoughton@...gle.com>, Paolo Bonzini
 <pbonzini@...hat.com>,  David Matlack <dmatlack@...gle.com>, David Rientjes
 <rientjes@...gle.com>, Marc Zyngier <maz@...nel.org>,  Oliver Upton
 <oliver.upton@...ux.dev>, Wei Xu <weixugc@...gle.com>, Yu Zhao
 <yuzhao@...gle.com>, Axel Rasmussen <axelrasmussen@...gle.com>,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > > waiting for aging. This contention reduction improves guest performance
> > > and saves a significant amount of Google Cloud's CPU usage, and it has
> > > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > > 
> > > Please see v8[8] for some performance results using
> > > access_tracking_perf_test patched to use MGLRU.
> > > 
> > > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> > 
> > Hi, I have a question about this patch series and about the
> > access_tracking_perf_test:
> > 
> > Some time ago, I investigated a failure in access_tracking_perf_test which
> > shows up in our CI.
> > 
> > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> > MGLRU is enabled, and overall I got the impression that MGLRU is not
> > compatible with idle page tracking.
> > 
> > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> > MMUs participate in MM_WALK' patch series could address this but the test
> > still fails.
> > 
> > 
> > For the reference the exact problem is:
> > 
> > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> > 
> > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> > 
> > 3. A NUMA autobalance code write protects the guest memory. KVM in response
> >    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
> >    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
> >    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
> >    of the guest access disappear and the kernel thinks that the page is still idle.
> > 
> > I can say that the root cause of this is that folio_mark_accessed doesn't do
> > what it supposed to do.
> > 
> > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> > will probably fix this but I don't have enough confidence to say if this is
> > all that is needed to fix this.  If this is the case I can send a patch.
> 
> My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> effectively isn't supported by MGLRU.
> 
> [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

Hi,

Reading this mail makes me think that the page idle interface isn't really used anymore.

Maybe we should redo the access_tracking_perf_test to only use the MGLRU specific interfaces/mode,
and remove its classical page_idle mode altogher?

The point I am trying to get across is that currently access_tracking_perf_test main 
purpose is to test that page_idle works with secondary paging and the fact is that it doesn't work 
well due to more that one reason:

The mere fact that we don't flush TLB already necessitated hacks like the 90% check,
which for example doesn't work nested so another hack was needed, to skip the check
completely when hypervisor is detected, etc, etc.

And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at all, 
which can happen at least in theory due to other reasons than NUMA balancing.


Tomorrow there will be something else that will cause KVM to zap the SPTEs, and the test will fail again,
and again...

What do you think?


> 
> > This patch makes the test pass (but only on 6.12 kernel and below, see below):
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 59f30a981c6f..2013e1f4d572 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
> >  {
> >         if (lru_gen_enabled()) {
> >                 folio_inc_refs(folio);
> > -               return;
> > +               goto clear_idle_bit;
> >         }
> >  
> >         if (!folio_test_referenced(folio)) {
> > @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
> >                 folio_clear_referenced(folio);
> >                 workingset_activation(folio);
> >         }
> > +clear_idle_bit:
> >         if (folio_test_idle(folio))
> >                 folio_clear_idle(folio);
> >  }
> > 
> > 
> > To always reproduce this, it is best to use a patch to make the test run in a
> > loop, like below (although the test fails without this as well).
> 
> ..
> 
> > With the above patch applied, you will notice after 4-6 iterations that the
> > number of still idle pages soars:
> > 
> > Populating memory             : 0.798882357s
> 
> ...
> 
> > vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> > Mark memory idle              : 2.711946690s
> > Writing to idle memory        : 0.302882502s
> > 
> > ...
> > 
> > (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> > primary paging structure, when the secondary was zapped was *removed*. I
> > bisected this to commit:
> > 
> > 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> > 
> > So now the access_tracking_test is broken regardless of MGLRU.
> 
> Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

Yes. The test always fails.

> 
> > Any ideas on how to fix all this mess?
> 
> The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
> enabled.  In a real-world scenario, if the guest is actually accessing the pages
> that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
> faulted back in.  There's a window where page_idle/bitmap could be read between
> making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
> one of the many flaws of NUMA balancing.
> 
> That said, one thing is quite odd.  In the failing case, *half* of the guest pages
> are still idle.  That's quite insane.
> 
> Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> different node, and that causes NUMA balancing to go crazy and zap pretty much
> all of guest memory.  If that's what's happening, then a better solution for the
> NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> pin it to a single pCPU?).

Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem persists.
On 6.13, the only way to make the test consistently work is to disable NUMA balancing.


Best regards,
	Maxim Levitsky