[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANRm+CzPE+7UVtQuT-R9kfh5NJYx5h9j=-if4fUM-9M9xHjX0Q@mail.gmail.com>
Date: Tue, 18 Nov 2025 22:19:56 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: Christian Borntraeger <borntraeger@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>,
Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Ilya Leoshkevich <iii@...ux.ibm.com>, Mete Durlu <meted@...ux.ibm.com>, Axel Busch <axel.busch@....com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
oversubscribed KVM
Hi Christian,
On Tue, 18 Nov 2025 at 16:12, Christian Borntraeger
<borntraeger@...ux.ibm.com> wrote:
>
> Am 12.11.25 um 06:01 schrieb Wanpeng Li:
> > Hi Christian,
> >
> > On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
> > <borntraeger@...ux.ibm.com> wrote:
> >>
> >> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> >>> From: Wanpeng Li <wanpengli@...cent.com>
> >>>
> >>> This series addresses long-standing yield_to() inefficiencies in
> >>> virtualized environments through two complementary mechanisms: a vCPU
> >>> debooster in the scheduler and IPI-aware directed yield in KVM.
> >>>
> >>> Problem Statement
> >>> -----------------
> >>>
> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> >>> held by other vCPUs that are not currently running. The kernel's
> >>> paravirtual spinlock support detects these situations and calls yield_to()
> >>> to boost the lock holder, allowing it to run and release the lock.
> >>>
> >>> However, the current implementation has two critical limitations:
> >>>
> >>> 1. Scheduler-side limitation:
> >>>
> >>> yield_to_task_fair() relies solely on set_next_buddy() to provide
> >>> preference to the target vCPU. This buddy mechanism only offers
> >>> immediate, transient preference. Once the buddy hint expires (typically
> >>> after one scheduling decision), the yielding vCPU may preempt the target
> >>> again, especially in nested cgroup hierarchies where vruntime domains
> >>> differ.
> >>>
> >>> This creates a ping-pong effect: the lock holder runs briefly, gets
> >>> preempted before completing critical sections, and the yielding vCPU
> >>> spins again, triggering another futile yield_to() cycle. The overhead
> >>> accumulates rapidly in workloads with high lock contention.
> >>
> >> I can certainly confirm that on s390 we do see that yield_to does not always
> >> work as expected. Our spinlock code is lock holder aware so our KVM always yield
> >> correctly but often enought the hint is ignored our bounced back as you describe.
> >> So I am certainly interested in that part.
> >>
> >> I need to look more closely into the other part.
> >
> > Thanks for the confirmation and interest! It's valuable to hear that
> > s390 observes similar yield_to() behavior where the hint gets ignored
> > or bounced back despite correct lock holder identification.
> >
> > Since your spinlock code is already lock-holder-aware and KVM yields
> > to the correct target, the scheduler-side improvements (patches 1-5)
> > should directly address the ping-pong issue you're seeing. The
> > vruntime penalties are designed to sustain the preference beyond the
> > transient buddy hint, which should reduce the bouncing effect.
>
> So we will play a bit with the first patches and check for performance improvements.
>
> I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
> see "more than count" numbers of the yield hypercalls with that testcase (as before).
> Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
> out your rate limit code I hit exactly the 4000000.
> Can you maybe outline a bit why the rate limit is important and needed?
Good catch! The 10× inflation is actually expected behavior. The key
insight is that rate limit filters penalty applications, not yield
hypercalls. In your ping-pong test with 4M counter increments, PLE
hardware fires multiple times per lock acquisition (roughly 10 times
based on your numbers), and each triggers kvm_vcpu_on_spin() . Without
rate limit, every yield immediately applies vruntime penalty. In tight
ping-pong, this causes over-penalization where the skip vCPU becomes
so deprioritized it effectively starves, which paradoxically
neutralizes the debooster effect. You see "exactly 4M" not because
it's working optimally, but because excessive penalties create a
pathological equilibrium where subsequent yields are suppressed by
starvation. With a 6ms rate limit, all 40M hypercalls still occur (PLE
still fires), but only the first yield in each burst applies a penalty
while subsequent ones are filtered. This gives you roughly 4M
penalties (one per actual lock acquisition) instead of 40M, providing
sustained advantage without over-penalization. The 6ms threshold was
empirically tuned as roughly 2× typical timeslice to filter intra-lock
PLE bursts while preserving responsiveness to legitimate contention.
Your test validates the design by showing rate limit prevents penalty
amplification even in the tightest ping-pong scenario.
I'll post v2 after the merge window with code comments addressing this
and other review feedback, which should be more suitable for
performance evaluation.
Wanpeng
Powered by blists - more mailing lists