linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CxZfFVk=dX3Koi_RUH6ppr_zc6fs3HHPaYkRGwV7h9L7w@mail.gmail.com>
Date: Wed, 12 Nov 2025 12:54:56 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hi Prateek,

On Tue, 11 Nov 2025 at 14:28, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> I haven't looked at the entire series and the penalty calculation math
> but I've a few questions looking at the cover-letter.

Thanks for the review and the thoughtful questions.

>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@...cent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> >    yield_to_task_fair() relies solely on set_next_buddy() to provide
> >    preference to the target vCPU. This buddy mechanism only offers
> >    immediate, transient preference. Once the buddy hint expires (typically
> >    after one scheduling decision), the yielding vCPU may preempt the target
> >    again, especially in nested cgroup hierarchies where vruntime domains
> >    differ.
>
> So what you are saying is there are configurations out there where vCPUs
> of same guest are put in different cgroups? Why? Does the use case
> warrant enabling the cpu controller for the subtree? Are you running

You're right to question this. The problematic scenario occurs with
nested cgroup hierarchies, which is common when VMs are deployed with
cgroup-based resource management. Even when all vCPUs of a single
guest are in the same leaf cgroup, that leaf sits under parent cgroups
with their own vruntime domains.

The issue manifests when:
   - set_next_buddy() provides preference at the leaf level
   - But vruntime competition happens at parent levels
   - The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy

The cpu controller is typically enabled in these deployments for quota
enforcement and weight-based sharing. That said, the debooster
mechanism is designed to be general-purpose: it handles any scenario
where yield_to() crosses cgroup boundaries, whether due to nested
hierarchies or sibling cgroups.

> with the "NEXT_BUDDY" sched feat enabled?

Yes, NEXT_BUDDY is enabled. The problem is that set_next_buddy()
provides only immediate, transient preference. Once the buddy hint is
consumed (typically after one pick_next_task_fair() call), the
yielding vCPU can preempt the target again if their vruntime values
haven't diverged sufficiently.

>
> If they are in the same cgroup, the recent optimizations/fixes to
> yield_task_fair() in queue:sched/core should help remedy some of the
> problems you might be seeing.

Agreed - the recent yield_task_fair() improvements in queue:sched/core
(EEVDF-based vruntime = deadline with hierarchical walk) are valuable.
However, our patchset focuses on yield_to() rather than yield(), which
has different semantics:
   - yield_task_fair(): "I voluntarily give up CPU, pick someone else"
→ Recent improvements handle this well with hierarchical walk
   - yield_to_task_fair(): "I want *this specific task* to run
instead" → Requires finding the LCA of yielder and target, then
applying penalties at that level to influence their relative
competition

The debooster extends yield_to() to handle cross-cgroup scenarios
where the yielder and target may be in different subtrees.

>
> For multiple cgroups, perhaps you can extend yield_task_fair() to do:

Thanks for the suggestion. Your hierarchical walk approach shares
similarities with our implementation. A few questions on the details:

>
> ( Only build and boot tested on top of
>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
>   select_task_rq_dl()" )
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b4617d631549..87560f5a18b3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>          * which yields immediately again; without the condition the vruntime
>          * ends up quickly running away.
>          */
> -       if (entity_eligible(cfs_rq, se)) {
> +       do {
> +               cfs_rq = cfs_rq_of(se);
> +
> +               /*
> +                * Another entity will be selected at next pick.
> +                * Single entity on cfs_rq can never be ineligible.
> +                */
> +               if (!entity_eligible(cfs_rq, se))
> +                       break;
> +
>                 se->vruntime = se->deadline;

Setting vruntime = deadline zeros out lag. Does this cause fairness
drift with repeated yields? We explicitly recalculate vlag after
adjustment to preserve EEVDF invariants.

>                 se->deadline += calc_delta_fair(se->slice, se);
> -       }
> +
> +               /*
> +                * If we have more than one runnable task queued below
> +                * this cfs_rq, the next pick will likely go for a
> +                * different entity now that we have advanced the
> +                * vruntime and the deadline of the running entity.
> +                */
> +               if (cfs_rq->h_nr_runnable > 1)

Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
correctly. Shouldn't the penalty apply at the LCA of yielder and
target? Otherwise the vruntime adjustment might not affect the level
where they actually compete.

> +                       break;
> +       } while ((se = parent_entity(se)));
>  }
>
>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> ---

Fixed one-slice penalties underperformed in our testing (dbench:
+14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
down to 1.0× based on queue size) necessary to balance effectiveness
against starvation.

>
> With that, I'm pretty sure there is a good chance we'll not select the
> hierarchy that did a yield_to() unless there is a large discrepancy in
> their weights and just advancing se->vruntime to se->deadline once isn't
> enough to make it ineligible and you'll have to do it multiple time (at
> which point that cgroup hierarchy needs to be studied).
>
> As for the problem that NEXT_BUDDY hint is used only once, you can
> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
> the "prev" task during schedule?

That's an interesting idea. However, LAST_BUDDY was removed from the
scheduler due to concerns about fairness and latency regressions in
general workloads. Reintroducing it globally might regress non-vCPU
workloads.

Our approach is more targeted: apply vruntime penalties specifically
in the yield_to() path (controlled by debugfs flag), avoiding impact
on general scheduling. The debooster is inert unless explicitly
enabled and rate-limited to prevent pathological overhead.

>
> >
> >    This creates a ping-pong effect: the lock holder runs briefly, gets
> >    preempted before completing critical sections, and the yielding vCPU
> >    spins again, triggering another futile yield_to() cycle. The overhead
> >    accumulates rapidly in workloads with high lock contention.
> >
> > 2. KVM-side limitation:
> >
> >    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
> >    directed yield candidate selection. However, it lacks awareness of IPI
> >    communication patterns. When a vCPU sends an IPI and spins waiting for
> >    a response (common in inter-processor synchronization), the current
> >    heuristics often fail to identify the IPI receiver as the yield target.
>
> Can't that be solved on the KVM end?

Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
orthogonal mechanisms:
   - Debooster: sustains yield_to() preference regardless of *who* is
yielding to whom
   - IPI tracking: improves *which* target is selected when a vCPU spins

Both showed independent gains in our testing, and combined effects
were approximately additive.

> Also shouldn't Patch 6 be on top with a "Fixes:" tag.

You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
bugfix and should be at the top with a Fixes tag. I'll reorder it in
v2 with:
Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
use a single for-loop")

>
> >
> >    Instead, the code may boost an unrelated vCPU based on coarse-grained
> >    preemption state, missing opportunities to accelerate actual IPI
> >    response handling. This is particularly problematic when the IPI receiver
> >    is runnable but not scheduled, as lock-holder-detection logic doesn't
> >    capture the IPI dependency relationship.
>
> Are you saying the yield_to() is called with an incorrect target vCPU?

Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
selection logic before yield_to() is called. Without IPI tracking, it
relies on preemption state, which doesn't capture "vCPU waiting for
IPI response from specific other vCPU."

The IPI tracking records sender→receiver relationships at interrupt
delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
the IPI receiver when the sender spins (patch 9). This addresses
scenarios where the spinning vCPU is waiting for IPI acknowledgment
rather than lock release.

Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
   - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
   - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs

Gains are most pronounced at moderate overcommit where the IPI
receiver is often runnable but not scheduled.

Thanks again for the review and suggestions.

Best regards,
Wanpeng