linux-kernel - Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f76772c1-7ece-4bc2-a67f-1ba07256604a@amd.com>
Date: Mon, 5 Jan 2026 11:56:51 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Wanpeng Li <kernellwp@...il.com>, Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, "Paolo
 Bonzini" <pbonzini@...hat.com>, Sean Christopherson <seanjc@...gle.com>
CC: Christian Borntraeger <borntraeger@...ux.ibm.com>, Steven Rostedt
	<rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, <linux-kernel@...r.kernel.org>,
	<kvm@...r.kernel.org>, Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hello Wanpeng,

On 12/19/2025 9:23 AM, Wanpeng Li wrote:
> Part 1: Scheduler vCPU Debooster (patches 1-5)
> 
> Augment yield_to_task_fair() with bounded vruntime penalties to provide
> sustained preference beyond the buddy mechanism. When a vCPU yields to a
> target, apply a carefully tuned vruntime penalty to the yielding vCPU,
> ensuring the target maintains scheduling advantage for longer periods.

Do you still see the problem after the fixes in commits:

127b90315ca0 ("sched/proxy: Yield the donor task")
79104becf42b ("sched/fair: Forfeit vruntime on yield")

Starting 79104becf42b, we push the vruntime on yield too which should
prevent the yield loop between vCPUs on same cgroup on the same vCPU.

If you have the following cgroup hierarchy:

           root
          /    \
         /      \
        /        \
       A          B
      / \         |
     /   \        |
  vCPU0  vCPU1  vCPU0

and vCPU0(A) yields to vCPU1(A) in the same cgroup vCPU1 should start
running after vCPU0 has pushed its vruntime enough to make it
ineligible.

If you have vCPUs across different cgroups with CPU controllers enabled,
I hope you have a very good reason to have such a setup because
otherwise, this is just too much to complexity for some theoretical,
insane deployment.

> 
> The mechanism is EEVDF-aware and cgroup-hierarchy-aware:
> 
> - Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
>   both the yielding and target tasks coexist. This ensures vruntime
>   adjustments occur at the correct hierarchy level, maintaining fairness
>   across cgroup boundaries.
> 
> - Update EEVDF scheduler fields (vruntime, deadline) atomically to keep
>   the scheduler state consistent. Note that vlag is intentionally not
>   modified as it will be recalculated on dequeue/enqueue cycles. The
>   penalty shifts the yielding task's virtual deadline forward, allowing
>   the target to run.
> 
> - Apply queue-size-adaptive penalties that scale from 6.0x scheduling
>   granularity for 2-task scenarios (strong preference) down to 1.0x for
>   large queues (>12 tasks), balancing preference against starvation risks.
> 
> - Implement reverse-pair debouncing: when task A yields to B, then B yields
>   to A within a short window (~600us), downscale the penalty to prevent
>   ping-pong oscillation.
> 
> - Rate-limit penalty application to 6ms intervals to prevent pathological
>   overhead when yields occur at very high frequency.

I still don't like all this complexity. How much better is it than doing
something like a:

  (Only build tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7377f9117501..fbb263ea7d5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9079,6 +9079,7 @@ static void yield_task_fair(struct rq *rq)
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	unsigned long weight;
 
 	/* !se->on_rq also covers throttled task */
 	if (!se->on_rq)
@@ -9089,6 +9090,32 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 
 	yield_task_fair(rq);
 
+	se = &rq->donor->se;
+	weight = se->load.weight;
+
+	/* Proportionally yield the hierarchy. */
+	while ((se = parent_entity(se))) {
+		unsigned long gcfs_rq_weight = group_cfs_rq(se)->load.weight;
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		WARN_ON_ONCE(se != cfs_rq->curr);
+		update_curr(cfs_rq);
+
+		/* Don't yield beyond the point of ineligibility. */
+		if (!entity_eligible(cfs_rq, se))
+			break;
+		/*
+		 * Proportionally increase the vruntime based on the slice
+		 * and the weight of the yielding subtree.
+		 */
+		se->vruntime += div_u64(calc_delta_fair(se->slice, se) * weight, gcfs_rq_weight);
+		update_deadline(cfs_rq, se);
+
+		/* Update the proportional wight of task on parent hierarchy. */
+		weight = (se->load.weight * weight) / gcfs_rq_weight;
+		if (!weight)
+			break;
+	}
 	return true;
 }
 
base-commit: 6ab7973f254071faf20fe5fcc502a3fe9ca14a47
---

Prepared on top of tip:sched/core. I don't like the above either and I'm
90% sure commit 79104becf42b ("sched/fair: Forfeit vruntime on yield")
will solve the problem you are seeing.

> Performance Results
> -------------------
> 
> Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM
> 
> Dbench 16 clients per VM (filesystem metadata operations):
>   2 VMs: +14.4% throughput (lock contention reduction)
>   3 VMs:  +9.8% throughput
>   4 VMs:  +6.7% throughput
> 

And what does the cgroup hierarchy look like for these tests?

-- 
Thanks and Regards,
Prateek