linux-kernel - Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c605b4e395a3538d9a2790918b78f4834912d72.camel@redhat.com>
Date: Wed, 26 Feb 2025 13:39:02 -0500
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: James Houghton <jthoughton@...gle.com>, Paolo Bonzini
 <pbonzini@...hat.com>,  David Matlack <dmatlack@...gle.com>, David Rientjes
 <rientjes@...gle.com>, Marc Zyngier <maz@...nel.org>,  Oliver Upton
 <oliver.upton@...ux.dev>, Wei Xu <weixugc@...gle.com>, Yu Zhao
 <yuzhao@...gle.com>, Axel Rasmussen <axelrasmussen@...gle.com>,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> > > My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> > > effectively isn't supported by MGLRU.
> > > 
> > > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com
> > 
> > Hi,
> > 
> > Reading this mail makes me think that the page idle interface isn't really
> > used anymore.
> 
> I'm sure it's still used in production somewhere.  And even if it's being phased
> out in favor of MGLRU, it's still super useful for testing purposes, because it
> gives userspace much more direct control over aging.

I also think that page_idle is used somewhere in production, and it probably works
more or less correctly with regular non VM processes, although I have no idea how well it works
when MGLRU is enabled.

My point was that using page_idle to track guest memory is something that is probably
not used because it doesn't work that well, and nobody seems to complain.
However I don't ask for it to be removed, although a note of deprecation might
be worth it if really nobody uses it.

> 
> > Maybe we should redo the access_tracking_perf_test to only use the MGLRU
> > specific interfaces/mode, and remove its classical page_idle mode altogher?
> 
> I don't want to take a hard dependency on MGLRU (unless page_idle gets fully
> deprecated/removed by the kernel), and I also don't think page_idle is the main
> problem with the test.
>    
> > The point I am trying to get across is that currently
> > access_tracking_perf_test main purpose is to test that page_idle works with
> > secondary paging and the fact is that it doesn't work well due to more that
> > one reason:
> 
> The primary purpose of the test is to measure performance.  Asserting that 90%+
> pages were dirtied is a sanity check, not an outright goal.

>From my POV, a performance test can't really be a selftest unless it actually fails
when it detects an unusually low performance. 

Otherwise who is going to be alarmed when a regression happens and
things actually get slower?

> 
> > The mere fact that we don't flush TLB already necessitated hacks like the 90%
> > check, which for example doesn't work nested so another hack was needed, to
> > skip the check completely when hypervisor is detected, etc, etc.
> 
> 100% agreed here.
> 
> > And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at
> > all, which can happen at least in theory due to other reasons than NUMA balancing.
> > 
> > Tomorrow there will be something else that will cause KVM to zap the SPTEs,
> > and the test will fail again, and again...
> > 
> > What do you think?
> 
> What if we make the assertion user controllable?  I.e. let the user opt-out (or
> off-by-default and opt-in) via command line?  We did something similar for the
> rseq test, because the test would run far fewer iterations than expected if the
> vCPU task was migrated to CPU(s) in deep sleep states.
> 
> 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> 		    "  or run with -u to disable this sanity check.", i);
> 
> This is quite similar, because as you say, it's impractical for the test to account
> for every possible environmental quirk.

No objections in principle, especially if sanity check is skipped by default, 
although this does sort of defeats the purpose of the check. 
I guess that the check might still be used for developers.


> 
> > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > all of guest memory.  If that's what's happening, then a better solution for the
> > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > pin it to a single pCPU?).
> > 
> > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > persists.  On 6.13, the only way to make the test consistently work is to
> > disable NUMA balancing.
> 
> Well that's odd.  While I'm quite curious as to what's happening, my stance is
> that enabling NUMA balancing with KVM is a terrible idea, so my vote is to sweep
> it under the rug and let the user disable the sanity check.
> 

One thing for sure, with NUMA balancing off, the test passes well (shows on average
around 100-200 idle pages) and I have run it for a long time.


Best regards,
	Maxim Levitsky