linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANRm+Cza0iiB8XqD+Jn9-eqAyjDFm9u1vqmyj9eGdVd-mpV7vg@mail.gmail.com>
Date: Thu, 13 Nov 2025 13:37:39 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hi Prateek,

On Wed, 12 Nov 2025 at 14:07, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>> Problem Statement
> >>> -----------------
> >>>
> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> >>> held by other vCPUs that are not currently running. The kernel's
> >>> paravirtual spinlock support detects these situations and calls yield_to()
> >>> to boost the lock holder, allowing it to run and release the lock.
> >>>
> >>> However, the current implementation has two critical limitations:
> >>>
> >>> 1. Scheduler-side limitation:
> >>>
> >>>    yield_to_task_fair() relies solely on set_next_buddy() to provide
> >>>    preference to the target vCPU. This buddy mechanism only offers
> >>>    immediate, transient preference. Once the buddy hint expires (typically
> >>>    after one scheduling decision), the yielding vCPU may preempt the target
> >>>    again, especially in nested cgroup hierarchies where vruntime domains
> >>>    differ.
> >>
> >> So what you are saying is there are configurations out there where vCPUs
> >> of same guest are put in different cgroups? Why? Does the use case
> >> warrant enabling the cpu controller for the subtree? Are you running
> >
> > You're right to question this. The problematic scenario occurs with
> > nested cgroup hierarchies, which is common when VMs are deployed with
> > cgroup-based resource management. Even when all vCPUs of a single
> > guest are in the same leaf cgroup, that leaf sits under parent cgroups
> > with their own vruntime domains.
> >
> > The issue manifests when:
> >    - set_next_buddy() provides preference at the leaf level
> >    - But vruntime competition happens at parent levels
>
> If that is the case, then NEXT_BUDDY is in-eligible as a result of its
> vruntime being higher that the weighted averages of other entity.
> Won't this break fairness?

Yes, it does break strict vruntime fairness temporarily. That's
intentional. The problem: buddy expires after one pick, then vruntime
wins → ping-pong. The spinning vCPU wastes CPU while the lock holder
stays preempted. The fix applies a bounded vruntime penalty to the
yielder at the cgroup LCA level:

Bounds:
  * Rate limited: 6ms minimum interval between deboosting
  * Queue-adaptive caps: 6.0× gran for 2-task ping-pong, decays to
1.0× gran for large queues (12+)
  * Debounce: 600µs window detects A→B→A reverse patterns and reduces penalty
  * Hierarchy-aware: Applied at LCA, so same-cgroup yields have localized impact

Why acceptable: Current behavior is already unfair—wasting CPU on
spinning instead of productive work. Bounded vruntime penalty lets the
lock holder complete faster, reducing overall waste. The scheduler
still converges to fairness—the penalty just gives the boosted task
sustained advantage until it finishes the critical section. Runtime
toggle available via
/sys/kernel/debug/sched/sched_vcpu_debooster_enabled if degradation
observed. Dbench results show net throughput wins (+6-14%) outweigh
the temporary fairness deviation.

Regards,
Wanpeng