[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c605b4e395a3538d9a2790918b78f4834912d72.camel@redhat.com>
Date: Wed, 26 Feb 2025 13:39:02 -0500
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: James Houghton <jthoughton@...gle.com>, Paolo Bonzini
<pbonzini@...hat.com>, David Matlack <dmatlack@...gle.com>, David Rientjes
<rientjes@...gle.com>, Marc Zyngier <maz@...nel.org>, Oliver Upton
<oliver.upton@...ux.dev>, Wei Xu <weixugc@...gle.com>, Yu Zhao
<yuzhao@...gle.com>, Axel Rasmussen <axelrasmussen@...gle.com>,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> > > My understanding is that the behavior is deliberate. Per Yu[1], page_idle/bitmap
> > > effectively isn't supported by MGLRU.
> > >
> > > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com
> >
> > Hi,
> >
> > Reading this mail makes me think that the page idle interface isn't really
> > used anymore.
>
> I'm sure it's still used in production somewhere. And even if it's being phased
> out in favor of MGLRU, it's still super useful for testing purposes, because it
> gives userspace much more direct control over aging.
I also think that page_idle is used somewhere in production, and it probably works
more or less correctly with regular non VM processes, although I have no idea how well it works
when MGLRU is enabled.
My point was that using page_idle to track guest memory is something that is probably
not used because it doesn't work that well, and nobody seems to complain.
However I don't ask for it to be removed, although a note of deprecation might
be worth it if really nobody uses it.
>
> > Maybe we should redo the access_tracking_perf_test to only use the MGLRU
> > specific interfaces/mode, and remove its classical page_idle mode altogher?
>
> I don't want to take a hard dependency on MGLRU (unless page_idle gets fully
> deprecated/removed by the kernel), and I also don't think page_idle is the main
> problem with the test.
>
> > The point I am trying to get across is that currently
> > access_tracking_perf_test main purpose is to test that page_idle works with
> > secondary paging and the fact is that it doesn't work well due to more that
> > one reason:
>
> The primary purpose of the test is to measure performance. Asserting that 90%+
> pages were dirtied is a sanity check, not an outright goal.
>From my POV, a performance test can't really be a selftest unless it actually fails
when it detects an unusually low performance.
Otherwise who is going to be alarmed when a regression happens and
things actually get slower?
>
> > The mere fact that we don't flush TLB already necessitated hacks like the 90%
> > check, which for example doesn't work nested so another hack was needed, to
> > skip the check completely when hypervisor is detected, etc, etc.
>
> 100% agreed here.
>
> > And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at
> > all, which can happen at least in theory due to other reasons than NUMA balancing.
> >
> > Tomorrow there will be something else that will cause KVM to zap the SPTEs,
> > and the test will fail again, and again...
> >
> > What do you think?
>
> What if we make the assertion user controllable? I.e. let the user opt-out (or
> off-by-default and opt-in) via command line? We did something similar for the
> rseq test, because the test would run far fewer iterations than expected if the
> vCPU task was migrated to CPU(s) in deep sleep states.
>
> TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> " Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> " e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> " or run with -u to disable this sanity check.", i);
>
> This is quite similar, because as you say, it's impractical for the test to account
> for every possible environmental quirk.
No objections in principle, especially if sanity check is skipped by default,
although this does sort of defeats the purpose of the check.
I guess that the check might still be used for developers.
>
> > > Aha! I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > all of guest memory. If that's what's happening, then a better solution for the
> > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > pin it to a single pCPU?).
> >
> > Nope. I pinned main thread to CPU 0 and VM thread to CPU 1 and the problem
> > persists. On 6.13, the only way to make the test consistently work is to
> > disable NUMA balancing.
>
> Well that's odd. While I'm quite curious as to what's happening, my stance is
> that enabling NUMA balancing with KVM is a terrible idea, so my vote is to sweep
> it under the rug and let the user disable the sanity check.
>
One thing for sure, with NUMA balancing off, the test passes well (shows on average
around 100-200 idle pages) and I have run it for a long time.
Best regards,
Maxim Levitsky
Powered by blists - more mailing lists