linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2a57185c-dce1-46c6-96f6-f51a81cd42a8@linux.ibm.com>
Date: Tue, 18 Nov 2025 09:11:33 +0100
From: Christian Borntraeger <borntraeger@...ux.ibm.com>
To: Wanpeng Li <kernellwp@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Steven Rostedt
 <rostedt@...dmis.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, Ilya Leoshkevich <iii@...ux.ibm.com>,
        Mete Durlu <meted@...ux.ibm.com>, Axel Busch <axel.busch@....com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Am 12.11.25 um 06:01 schrieb Wanpeng Li:
> Hi Christian,
> 
> On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
> <borntraeger@...ux.ibm.com> wrote:
>>
>> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
>>> From: Wanpeng Li <wanpengli@...cent.com>
>>>
>>> This series addresses long-standing yield_to() inefficiencies in
>>> virtualized environments through two complementary mechanisms: a vCPU
>>> debooster in the scheduler and IPI-aware directed yield in KVM.
>>>
>>> Problem Statement
>>> -----------------
>>>
>>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
>>> held by other vCPUs that are not currently running. The kernel's
>>> paravirtual spinlock support detects these situations and calls yield_to()
>>> to boost the lock holder, allowing it to run and release the lock.
>>>
>>> However, the current implementation has two critical limitations:
>>>
>>> 1. Scheduler-side limitation:
>>>
>>>      yield_to_task_fair() relies solely on set_next_buddy() to provide
>>>      preference to the target vCPU. This buddy mechanism only offers
>>>      immediate, transient preference. Once the buddy hint expires (typically
>>>      after one scheduling decision), the yielding vCPU may preempt the target
>>>      again, especially in nested cgroup hierarchies where vruntime domains
>>>      differ.
>>>
>>>      This creates a ping-pong effect: the lock holder runs briefly, gets
>>>      preempted before completing critical sections, and the yielding vCPU
>>>      spins again, triggering another futile yield_to() cycle. The overhead
>>>      accumulates rapidly in workloads with high lock contention.
>>
>> I can certainly confirm that on s390 we do see that yield_to does not always
>> work as expected. Our spinlock code is lock holder aware so our KVM always yield
>> correctly but often enought the hint is ignored our bounced back as you describe.
>> So I am certainly interested in that part.
>>
>> I need to look more closely into the other part.
> 
> Thanks for the confirmation and interest! It's valuable to hear that
> s390 observes similar yield_to() behavior where the hint gets ignored
> or bounced back despite correct lock holder identification.
> 
> Since your spinlock code is already lock-holder-aware and KVM yields
> to the correct target, the scheduler-side improvements (patches 1-5)
> should directly address the ping-pong issue you're seeing. The
> vruntime penalties are designed to sustain the preference beyond the
> transient buddy hint, which should reduce the bouncing effect.

So we will play a bit with the first patches and check for performance improvements.

I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
see "more than count" numbers of the yield hypercalls with that testcase (as before).
Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
out your rate limit code I hit exactly the 4000000.
Can you maybe outline a bit why the rate limit is important and needed?