lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CzVtzgYYwgaqEMmsOAo7m=Esd9rd-zbB7zXzgL_p5SgxQ@mail.gmail.com>
Date: Wed, 12 Nov 2025 13:01:38 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: Christian Borntraeger <borntraeger@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Ilya Leoshkevich <iii@...ux.ibm.com>, Mete Durlu <meted@...ux.ibm.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hi Christian,

On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
<borntraeger@...ux.ibm.com> wrote:
>
> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> > From: Wanpeng Li <wanpengli@...cent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> >     yield_to_task_fair() relies solely on set_next_buddy() to provide
> >     preference to the target vCPU. This buddy mechanism only offers
> >     immediate, transient preference. Once the buddy hint expires (typically
> >     after one scheduling decision), the yielding vCPU may preempt the target
> >     again, especially in nested cgroup hierarchies where vruntime domains
> >     differ.
> >
> >     This creates a ping-pong effect: the lock holder runs briefly, gets
> >     preempted before completing critical sections, and the yielding vCPU
> >     spins again, triggering another futile yield_to() cycle. The overhead
> >     accumulates rapidly in workloads with high lock contention.
>
> I can certainly confirm that on s390 we do see that yield_to does not always
> work as expected. Our spinlock code is lock holder aware so our KVM always yield
> correctly but often enought the hint is ignored our bounced back as you describe.
> So I am certainly interested in that part.
>
> I need to look more closely into the other part.

Thanks for the confirmation and interest! It's valuable to hear that
s390 observes similar yield_to() behavior where the hint gets ignored
or bounced back despite correct lock holder identification.

Since your spinlock code is already lock-holder-aware and KVM yields
to the correct target, the scheduler-side improvements (patches 1-5)
should directly address the ping-pong issue you're seeing. The
vruntime penalties are designed to sustain the preference beyond the
transient buddy hint, which should reduce the bouncing effect.

Best regards,
Wanpeng

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ