linux-kernel - Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CwNS99ORAdQvrCg4rFs3TtKBR6TjEJnScdxy3uP+DRiOw@mail.gmail.com>
Date: Sun, 4 Jan 2026 10:40:30 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>, 
	Christian Borntraeger <borntraeger@...ux.ibm.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

ping, :)
On Fri, 19 Dec 2025 at 11:53, Wanpeng Li <kernellwp@...il.com> wrote:
>
> From: Wanpeng Li <wanpengli@...cent.com>
>
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
>
> Problem Statement
> -----------------
>
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
>
> However, the current implementation has two critical limitations:
>
> 1. Scheduler-side limitation:
>
>    yield_to_task_fair() relies solely on set_next_buddy() to provide
>    preference to the target vCPU. This buddy mechanism only offers
>    immediate, transient preference. Once the buddy hint expires (typically
>    after one scheduling decision), the yielding vCPU may preempt the target
>    again, especially in nested cgroup hierarchies where vruntime domains
>    differ.
>
>    This creates a ping-pong effect: the lock holder runs briefly, gets
>    preempted before completing critical sections, and the yielding vCPU
>    spins again, triggering another futile yield_to() cycle. The overhead
>    accumulates rapidly in workloads with high lock contention.
>
> 2. KVM-side limitation:
>
>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>    directed yield candidate selection. However, it lacks awareness of IPI
>    communication patterns. When a vCPU sends an IPI and spins waiting for
>    a response (common in inter-processor synchronization), the current
>    heuristics often fail to identify the IPI receiver as the yield target.
>
>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>    preemption state, missing opportunities to accelerate actual IPI
>    response handling. This is particularly problematic when the IPI
>    receiver is runnable but not scheduled, as lock-holder-detection logic
>    doesn't capture the IPI dependency relationship.
>
> Combined, these issues cause excessive lock hold times, cache thrashing,
> and degraded throughput in overcommitted environments, particularly
> affecting workloads with fine-grained synchronization patterns.
>
> Solution Overview
> -----------------
>
> The series introduces two orthogonal improvements that work synergistically:
>
> Part 1: Scheduler vCPU Debooster (patches 1-5)
>
> Augment yield_to_task_fair() with bounded vruntime penalties to provide
> sustained preference beyond the buddy mechanism. When a vCPU yields to a
> target, apply a carefully tuned vruntime penalty to the yielding vCPU,
> ensuring the target maintains scheduling advantage for longer periods.
>
> The mechanism is EEVDF-aware and cgroup-hierarchy-aware:
>
> - Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
>   both the yielding and target tasks coexist. This ensures vruntime
>   adjustments occur at the correct hierarchy level, maintaining fairness
>   across cgroup boundaries.
>
> - Update EEVDF scheduler fields (vruntime, deadline) atomically to keep
>   the scheduler state consistent. Note that vlag is intentionally not
>   modified as it will be recalculated on dequeue/enqueue cycles. The
>   penalty shifts the yielding task's virtual deadline forward, allowing
>   the target to run.
>
> - Apply queue-size-adaptive penalties that scale from 6.0x scheduling
>   granularity for 2-task scenarios (strong preference) down to 1.0x for
>   large queues (>12 tasks), balancing preference against starvation risks.
>
> - Implement reverse-pair debouncing: when task A yields to B, then B yields
>   to A within a short window (~600us), downscale the penalty to prevent
>   ping-pong oscillation.
>
> - Rate-limit penalty application to 6ms intervals to prevent pathological
>   overhead when yields occur at very high frequency.
>
> The debooster works *with* the buddy mechanism rather than replacing it:
> set_next_buddy() provides immediate preference for the next scheduling
> decision, while the vruntime penalty sustains that preference over
> subsequent decisions. This dual approach proves especially effective in
> nested cgroup scenarios where buddy hints alone are insufficient.
>
> Part 2: KVM IPI-Aware Directed Yield (patches 6-9)
>
> Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
> directed yield candidate selection. Track sender/receiver relationships
> when IPIs are delivered and use this information to prioritize yield
> targets.
>
> The tracking mechanism:
>
> - Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
>   common case for inter-processor synchronization). When exactly one
>   destination vCPU receives an IPI, record the sender->receiver relationship
>   with a monotonic timestamp.
>
>   In high VM density scenarios, software-based IPI tracking through
>   interrupt delivery interception becomes particularly valuable. It
>   captures precise sender/receiver relationships that can be leveraged
>   for intelligent scheduling decisions, providing performance benefits
>   that complement or even exceed hardware-accelerated interrupt delivery
>   in overcommitted environments.
>
> - Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
>   per-vCPU ipi_context structure is carefully designed to avoid cache line
>   bouncing.
>
> - Implements a short recency window (50ms default) to avoid stale IPI
>   information inflating boost priority on throughput-sensitive workloads.
>   Old IPI relationships are naturally aged out.
>
> - Clears IPI context on EOI with two-stage precision: unconditionally clear
>   the receiver's context (it processed the interrupt), but only clear the
>   sender's pending flag if the receiver matches and the IPI is recent. This
>   prevents unrelated EOIs from prematurely clearing valid IPI state.
>
> The candidate selection follows a priority hierarchy:
>
>   Priority 1: Confirmed IPI receiver
>     If the spinning vCPU recently sent an IPI to another vCPU and that IPI
>     is still pending (within the recency window), unconditionally boost the
>     receiver. This directly addresses the "spinning on IPI response" case.
>
>   Priority 2: Fast pending interrupt
>     Leverage arch-specific kvm_arch_dy_has_pending_interrupt() for
>     compatibility with existing optimizations.
>
>   Priority 3: Preempted in kernel mode
>     Fall back to traditional preemption-based logic when yield_to_kernel_mode
>     is requested, ensuring compatibility with existing workloads.
>
> A two-round fallback mechanism provides a safety net: if the first round
> with strict IPI-aware selection finds no eligible candidate (e.g., due to
> missed IPI context or transient runnable set changes), a second round
> applies relaxed selection gated only by preemption state. This is
> controlled by the enable_relaxed_boost module parameter (default on).
>
> Implementation Details
> ----------------------
>
> Both mechanisms are designed for minimal overhead and runtime control:
>
> - All locking occurs under existing rq->lock or per-vCPU locks; no new
>   lock contention is introduced.
>
> - Penalty calculations use integer arithmetic with overflow protection.
>
> - IPI tracking uses monotonic timestamps (ktime_get_mono_fast_ns()) for
>   efficient, race-free recency checks.
>
> Advantages over paravirtualization approaches:
>
> - No guest OS modification required: This solution operates entirely within
>   the host kernel, providing transparent optimization without guest kernel
>   changes or recompilation.
>
> - Guest OS agnostic: Works uniformly across Linux, Windows, and other guest
>   operating systems, unlike PV TLB shootdown which requires guest-side
>   paravirtual driver support.
>
> - Broader applicability: Captures IPI patterns from all synchronization
>   primitives (spinlocks, RCU, smp_call_function, etc.), not limited to
>   specific paravirtualized operations like TLB shootdown.
>
> - Deployment simplicity: Existing VM images benefit immediately without
>   guest kernel updates, critical for production environments with diverse
>   guest OS versions and configurations.
>
> - Runtime controls allow disabling features if needed:
>   * /sys/kernel/debug/sched/vcpu_debooster_enabled
>   * /sys/module/kvm/parameters/ipi_tracking_enabled
>   * /sys/module/kvm/parameters/enable_relaxed_boost
>
> - The infrastructure is incrementally introduced: early patches add inert
>   scaffolding that can be verified for zero performance impact before
>   activation.
>
> Performance Results
> -------------------
>
> Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM
>
> Dbench 16 clients per VM (filesystem metadata operations):
>   2 VMs: +14.4% throughput (lock contention reduction)
>   3 VMs:  +9.8% throughput
>   4 VMs:  +6.7% throughput
>
> PARSEC Dedup benchmark, simlarge input (memory-intensive):
>   2 VMs: +47.1% throughput (IPI-heavy synchronization)
>   3 VMs: +28.1% throughput
>   4 VMs:  +1.7% throughput
>
> PARSEC VIPS benchmark, simlarge input (compute-intensive):
>   2 VMs: +26.2% throughput (balanced sync and compute)
>   3 VMs: +12.7% throughput
>   4 VMs:  +6.0% throughput
>
> Analysis:
>
> - Gains are most pronounced at moderate overcommit (2-3 VMs). At this level,
>   contention is significant enough to benefit from better yield behavior,
>   but context switch overhead remains manageable.
>
> - Dedup shows the strongest improvement (+47.1% at 2 VMs) due to its
>   IPI-heavy synchronization patterns. The IPI-aware directed yield
>   precisely targets the bottleneck.
>
> - At 4 VMs (heavier overcommit), gains diminish as general CPU contention
>   dominates. However, performance never regresses, indicating the mechanisms
>   gracefully degrade.
>
> - In certain high-density, resource overcommitted deployment scenarios, the
>   performance benefits of APICv can be constrained by scheduling and
>   contention patterns. In such cases, software-based IPI tracking serves as
>   a complementary optimization path, offering targeted scheduling hints
>   without relying on disabling APICv. The practical choice should be
>   evaluated and balanced against workload characteristics and platform
>   configuration.
>
> - Dbench benefits primarily from the scheduler-side debooster, as its lock
>   patterns involve less IPI spinning and more direct lock holder boosting.
>
> The performance gains stem from three factors:
>
> 1. Lock holders receive sustained CPU time to complete critical sections,
>    reducing overall lock hold duration and cascading contention.
>
> 2. IPI receivers are promptly scheduled when senders spin, minimizing IPI
>    response latency and reducing wasted spin cycles.
>
> 3. Better cache utilization results from reduced context switching between
>    lock waiters and holders.
>
> Patch Organization
> ------------------
>
> The series is organized for incremental review and bisectability:
>
> Patches 1-5: Scheduler vCPU debooster
>
>   Patch 1: Add infrastructure (per-rq tracking, sysctl, debugfs entry)
>            Infrastructure is inert; no functional change.
>
>   Patch 2: Add rate-limiting and validation helpers
>            Static functions with comprehensive safety checks.
>
>   Patch 3: Add cgroup LCA finder for hierarchical yield
>            Implements CONFIG_FAIR_GROUP_SCHED-aware LCA location.
>
>   Patch 4: Add penalty calculation and application logic
>            Core algorithms with queue-size adaptation and debouncing.
>
>   Patch 5: Wire up yield deboost in yield_to_task_fair()
>            Activation patch. Includes Dbench performance data.
>
> Patches 6-9: KVM IPI-aware directed yield
>
>   Patch 6: Add IPI tracking infrastructure
>            Per-vCPU context, module parameters, helper functions.
>            Infrastructure is inert until activated.
>
>   Patch 7: Integrate IPI tracking with LAPIC interrupt delivery
>            Hook into kvm_irq_delivery_to_apic() and EOI handling.
>
>   Patch 8: Implement IPI-aware directed yield candidate selection
>            Replace candidate selection logic with priority-based approach.
>            Includes PARSEC performance data.
>
>   Patch 9: Add relaxed boost as safety net
>            Two-round fallback mechanism for robustness.
>
> Each patch compiles and boots independently. Performance data is presented
> where the relevant mechanism becomes active (patches 5 and 8).
>
> Testing
> -------
>
> Workloads tested:
>
> - Dbench (filesystem metadata stress)
> - PARSEC benchmarks (Dedup, VIPS, Ferret, Blackscholes)
> - Kernel compilation (make -j16 in each VM)
>
> No regressions observed on any configuration. The mechanisms show neutral
> to positive impact across diverse workloads.
>
> Future Work
> -----------
>
> Potential extensions beyond this series:
>
> - Adaptive recency window: dynamically adjust ipi_window_ns based on
>   observed workload patterns.
>
> - Extended tracking: consider multi-round IPI patterns (A->B->C->A).
>
> - Cross-NUMA awareness: penalty scaling based on NUMA distances.
>
> These are intentionally deferred to keep this series focused and reviewable.
>
> v1 -> v2:
> - Rebase onto v6.19-rc1 (v1 was based on v6.18-rc4)
> - Drop "KVM: Fix last_boosted_vcpu index assignment bug" patch as v6.19-rc1
>   already contains this fix
> - Scheduler debooster changes:
>   * Adapt to v6.19's EEVDF forfeit behavior: yield_to_deboost() must be
>     called BEFORE yield_task_fair() to preserve fairness gap calculation.
>     In v6.19+, yield_task_fair() performs forfeit (se->vruntime =
>     se->deadline), which would inflate the yielding entity's vruntime
>     before penalty calculation, causing need=0 and only baseline penalty
>     being applied.
>   * Change from rq->curr to rq->donor for correct EEVDF donor tracking
>   * Change from nr_queued to h_nr_queued for accurate hierarchical task
>     counting in penalty cap calculation
>   * Remove vlag assignment as it will be recalculated on dequeue/enqueue
>     and modifying it for on-rq entity is incorrect
>   * Remove update_min_vruntime() call: in EEVDF the yielding entity is
>     always cfs_rq->curr (dequeued from RB-tree), so modifying its vruntime
>     does not affect min_vruntime calculation
>   * Remove unnecessary gran_floor safeguard (calc_delta_fair already
>     handles edge cases correctly)
>   * Rename debugfs entry from sched_vcpu_debooster_enabled to
>     vcpu_debooster_enabled for consistency
> - KVM IPI tracking changes:
>   * Improve documentation for module parameters
>   * Add kvm_vcpu_is_ipi_receiver() declaration to x86.h header
>
> Wanpeng Li (9):
>   sched: Add vCPU debooster infrastructure
>   sched/fair: Add rate-limiting and validation helpers
>   sched/fair: Add cgroup LCA finder for hierarchical yield
>   sched/fair: Add penalty calculation and application logic
>   sched/fair: Wire up yield deboost in yield_to_task_fair()
>   KVM: x86: Add IPI tracking infrastructure
>   KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
>   KVM: Implement IPI-aware directed yield candidate selection
>   KVM: Relaxed boost as safety net
>
>  arch/x86/include/asm/kvm_host.h |  12 ++
>  arch/x86/kvm/lapic.c            | 166 ++++++++++++++++-
>  arch/x86/kvm/x86.c              |   3 +
>  arch/x86/kvm/x86.h              |   8 +
>  include/linux/kvm_host.h        |   3 +
>  kernel/sched/core.c             |   9 +-
>  kernel/sched/debug.c            |   2 +
>  kernel/sched/fair.c             | 305 ++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h            |  12 ++
>  virt/kvm/kvm_main.c             |  74 +++++++-
>  10 files changed, 579 insertions(+), 15 deletions(-)
>
> --
> 2.43.0
>