linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a1e5a8db-8382-4f52-8ef2-3b62b0c031ab@linux.ibm.com>
Date: Mon, 10 Nov 2025 13:02:11 +0100
From: Christian Borntraeger <borntraeger@...ux.ibm.com>
To: Wanpeng Li <kernellwp@...il.com>, Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>
Cc: Steven Rostedt <rostedt@...dmis.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, Wanpeng Li <wanpengli@...cent.com>,
        Ilya Leoshkevich <iii@...ux.ibm.com>, Mete Durlu <meted@...ux.ibm.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> From: Wanpeng Li <wanpengli@...cent.com>
> 
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
> 
> Problem Statement
> -----------------
> 
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
> 
> However, the current implementation has two critical limitations:
> 
> 1. Scheduler-side limitation:
> 
>     yield_to_task_fair() relies solely on set_next_buddy() to provide
>     preference to the target vCPU. This buddy mechanism only offers
>     immediate, transient preference. Once the buddy hint expires (typically
>     after one scheduling decision), the yielding vCPU may preempt the target
>     again, especially in nested cgroup hierarchies where vruntime domains
>     differ.
> 
>     This creates a ping-pong effect: the lock holder runs briefly, gets
>     preempted before completing critical sections, and the yielding vCPU
>     spins again, triggering another futile yield_to() cycle. The overhead
>     accumulates rapidly in workloads with high lock contention.

I can certainly confirm that on s390 we do see that yield_to does not always
work as expected. Our spinlock code is lock holder aware so our KVM always yield
correctly but often enought the hint is ignored our bounced back as you describe.
So I am certainly interested in that part.

I need to look more closely into the other part.