linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <079d48d9-a663-49d5-9bc6-f6518eb54810@amd.com>
Date: Thu, 13 Nov 2025 10:12:13 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Wanpeng Li <kernellwp@...il.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>,
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt
	<rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, <linux-kernel@...r.kernel.org>,
	<kvm@...r.kernel.org>, Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hello Wanpeng,

On 11/12/2025 10:24 AM, Wanpeng Li wrote:
>>
>> ( Only build and boot tested on top of
>>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
>>   select_task_rq_dl()" )
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b4617d631549..87560f5a18b3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>          * which yields immediately again; without the condition the vruntime
>>          * ends up quickly running away.
>>          */
>> -       if (entity_eligible(cfs_rq, se)) {
>> +       do {
>> +               cfs_rq = cfs_rq_of(se);
>> +
>> +               /*
>> +                * Another entity will be selected at next pick.
>> +                * Single entity on cfs_rq can never be ineligible.
>> +                */
>> +               if (!entity_eligible(cfs_rq, se))
>> +                       break;
>> +
>>                 se->vruntime = se->deadline;
> 
> Setting vruntime = deadline zeros out lag. Does this cause fairness
> drift with repeated yields? We explicitly recalculate vlag after
> adjustment to preserve EEVDF invariants.

We only push deadline when the entity is eligible. Ineligible entity
will break out above. Also I don't get how adding a penalty to an
entity in the cgroup hierarchy of the yielding task when there are
other runnable tasks considered as "preserve(ing) EEVDF invariants".

> 
>>                 se->deadline += calc_delta_fair(se->slice, se);
>> -       }
>> +
>> +               /*
>> +                * If we have more than one runnable task queued below
>> +                * this cfs_rq, the next pick will likely go for a
>> +                * different entity now that we have advanced the
>> +                * vruntime and the deadline of the running entity.
>> +                */
>> +               if (cfs_rq->h_nr_runnable > 1)
> 
> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> correctly. Shouldn't the penalty apply at the LCA of yielder and
> target? Otherwise the vruntime adjustment might not affect the level
> where they actually compete.

So here is the case I'm going after - consider the following
hierarchy:

     root
    /    \
  CG0   CG1
   |     |
   A     B

  CG* are cgroups and, [A-Z]* are tasks

A decides to yield to B, and advances its deadline on CG0's timeline.
Currently, if CG0 is eligible and CG1 isn't, pick will still select
CG0 which will in-turn select task A and it'll yield again. This
cycle repeates until vruntime of CG0 turns large enough to make itself
ineligible and route the EEVDF pick to CG1.

Now consider:


       root
      /    \
    CG0   CG1
   /   \   |
  A     C  B

Same scenario: A yields to B. A advances its vruntime and deadline
as a prt of yield. Now, why should CG0 sacrifice its fair share of
runtime for A when task B is runnable? Just because one task decided
to yield to another task in a different cgroup doesn't mean other
waiting tasks on that hierarchy suffer.

> 
>> +                       break;
>> +       } while ((se = parent_entity(se)));
>>  }
>>
>>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
>> ---
> 
> Fixed one-slice penalties underperformed in our testing (dbench:
> +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> down to 1.0× based on queue size) necessary to balance effectiveness
> against starvation.

If all vCPUs of a VM are in the same cgroup - yield_to() should work
just fine. If this "target" task is not selected then either some
entity in the hierarchy, or the task is ineligible and EEVDF pick has
decided to go with something else.

It is not "starvation" but rather you've received you for fair share
of "proportional runtime" and now you wait. If you really want to
follow EEVDF maybe you compute the vlag and if it is behind the
avg_vruntime, you account it to the "target" task - that would be
in the spirit of the EEVDF algorithm.

> 
>>
>> With that, I'm pretty sure there is a good chance we'll not select the
>> hierarchy that did a yield_to() unless there is a large discrepancy in
>> their weights and just advancing se->vruntime to se->deadline once isn't
>> enough to make it ineligible and you'll have to do it multiple time (at
>> which point that cgroup hierarchy needs to be studied).
>>
>> As for the problem that NEXT_BUDDY hint is used only once, you can
>> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
>> the "prev" task during schedule?
> 
> That's an interesting idea. However, LAST_BUDDY was removed from the
> scheduler due to concerns about fairness and latency regressions in
> general workloads. Reintroducing it globally might regress non-vCPU
> workloads.
> 
> Our approach is more targeted: apply vruntime penalties specifically
> in the yield_to() path (controlled by debugfs flag), avoiding impact
> on general scheduling. The debooster is inert unless explicitly
> enabled and rate-limited to prevent pathological overhead.

Yeah, I'm still not on board with the idea but maybe I don't see the
vision. Hope other scheduler folks can chime in.

> 
>>
>>>
>>>    This creates a ping-pong effect: the lock holder runs briefly, gets
>>>    preempted before completing critical sections, and the yielding vCPU
>>>    spins again, triggering another futile yield_to() cycle. The overhead
>>>    accumulates rapidly in workloads with high lock contention.
>>>
>>> 2. KVM-side limitation:
>>>
>>>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>>>    directed yield candidate selection. However, it lacks awareness of IPI
>>>    communication patterns. When a vCPU sends an IPI and spins waiting for
>>>    a response (common in inter-processor synchronization), the current
>>>    heuristics often fail to identify the IPI receiver as the yield target.
>>
>> Can't that be solved on the KVM end?
> 
> Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
> scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
> orthogonal mechanisms:
>    - Debooster: sustains yield_to() preference regardless of *who* is
> yielding to whom
>    - IPI tracking: improves *which* target is selected when a vCPU spins
> 
> Both showed independent gains in our testing, and combined effects
> were approximately additive.

I'll try to look at the KVM bits but I'm not familiar enough with
those bits enough to review it well :)

> 
>> Also shouldn't Patch 6 be on top with a "Fixes:" tag.
> 
> You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
> bugfix and should be at the top with a Fixes tag. I'll reorder it in
> v2 with:
> Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
> use a single for-loop")

Thank you.

> 
>>
>>>
>>>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>>>    preemption state, missing opportunities to accelerate actual IPI
>>>    response handling. This is particularly problematic when the IPI receiver
>>>    is runnable but not scheduled, as lock-holder-detection logic doesn't
>>>    capture the IPI dependency relationship.
>>
>> Are you saying the yield_to() is called with an incorrect target vCPU?
> 
> Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
> selection logic before yield_to() is called. Without IPI tracking, it
> relies on preemption state, which doesn't capture "vCPU waiting for
> IPI response from specific other vCPU."
> 
> The IPI tracking records sender→receiver relationships at interrupt
> delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
> the IPI receiver when the sender spins (patch 9). This addresses
> scenarios where the spinning vCPU is waiting for IPI acknowledgment
> rather than lock release.
> 
> Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
>    - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
>    - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs
> 
> Gains are most pronounced at moderate overcommit where the IPI
> receiver is often runnable but not scheduled.
> 
> Thanks again for the review and suggestions.
> 
> Best regards,
> Wanpeng

-- 
Thanks and Regards,
Prateek