linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+Cy2O9j_itDmJcAwUebV2h=2hvfZxuxtHqKD-vF1XohGAw@mail.gmail.com>
Date: Thu, 13 Nov 2025 16:33:54 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hi Prateek,

On Thu, 13 Nov 2025 at 12:42, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>
> >> ( Only build and boot tested on top of
> >>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> >>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
> >>   select_task_rq_dl()" )
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b4617d631549..87560f5a18b3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>          * which yields immediately again; without the condition the vruntime
> >>          * ends up quickly running away.
> >>          */
> >> -       if (entity_eligible(cfs_rq, se)) {
> >> +       do {
> >> +               cfs_rq = cfs_rq_of(se);
> >> +
> >> +               /*
> >> +                * Another entity will be selected at next pick.
> >> +                * Single entity on cfs_rq can never be ineligible.
> >> +                */
> >> +               if (!entity_eligible(cfs_rq, se))
> >> +                       break;
> >> +
> >>                 se->vruntime = se->deadline;
> >
> > Setting vruntime = deadline zeros out lag. Does this cause fairness
> > drift with repeated yields? We explicitly recalculate vlag after
> > adjustment to preserve EEVDF invariants.
>
> We only push deadline when the entity is eligible. Ineligible entity
> will break out above. Also I don't get how adding a penalty to an
> entity in the cgroup hierarchy of the yielding task when there are
> other runnable tasks considered as "preserve(ing) EEVDF invariants".

Our penalty preserves EEVDF invariants by recalculating all scheduler state:
   se->vruntime = new_vruntime;
   se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
   se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
   update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
This is the same update pattern used in update_curr(). The EEVDF
relationship lag = (V - v) * w remains valid—vlag becomes more
negative as vruntime increases. The presence of other runnable tasks
doesn't affect the mathematical correctness; each entity's lag is
computed independently relative to avg_vruntime.

>
> >
> >>                 se->deadline += calc_delta_fair(se->slice, se);
> >> -       }
> >> +
> >> +               /*
> >> +                * If we have more than one runnable task queued below
> >> +                * this cfs_rq, the next pick will likely go for a
> >> +                * different entity now that we have advanced the
> >> +                * vruntime and the deadline of the running entity.
> >> +                */
> >> +               if (cfs_rq->h_nr_runnable > 1)
> >
> > Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> > correctly. Shouldn't the penalty apply at the LCA of yielder and
> > target? Otherwise the vruntime adjustment might not affect the level
> > where they actually compete.
>
> So here is the case I'm going after - consider the following
> hierarchy:
>
>      root
>     /    \
>   CG0   CG1
>    |     |
>    A     B
>
>   CG* are cgroups and, [A-Z]* are tasks
>
> A decides to yield to B, and advances its deadline on CG0's timeline.
> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> CG0 which will in-turn select task A and it'll yield again. This
> cycle repeates until vruntime of CG0 turns large enough to make itself
> ineligible and route the EEVDF pick to CG1.

Yes, natural convergence works, but requires multiple cycles. Your
h_nr_runnable > 1 stops propagation when another entity might be
picked, but "might" depends on vruntime ordering which needs time to
develop. Our penalty forces immediate ineligibility at the LCA. One
penalty application vs N natural yield cycles.

>
> Now consider:
>
>
>        root
>       /    \
>     CG0   CG1
>    /   \   |
>   A     C  B
>
> Same scenario: A yields to B. A advances its vruntime and deadline
> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> runtime for A when task B is runnable? Just because one task decided
> to yield to another task in a different cgroup doesn't mean other
> waiting tasks on that hierarchy suffer.

You're right that C suffers unfairly if it's independent work. This is
a known tradeoff. The rationale: when A spins on B's lock, we apply
the penalty at the LCA (root in your example) because that's where A
and B compete. This ensures B gets scheduled. The side effect is C
loses CPU time even though it's not involved in the dependency. In
practice: VMs typically put all vCPUs in one cgroup—no independent C
exists. If C exists and is affected by the same lock, the penalty
helps overall progress. If C is truly independent, it loses one
scheduling slice worth of time.

>
> >
> >> +                       break;
> >> +       } while ((se = parent_entity(se)));
> >>  }
> >>
> >>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> >> ---
> >
> > Fixed one-slice penalties underperformed in our testing (dbench:
> > +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> > down to 1.0× based on queue size) necessary to balance effectiveness
> > against starvation.
>
> If all vCPUs of a VM are in the same cgroup - yield_to() should work
> just fine. If this "target" task is not selected then either some
> entity in the hierarchy, or the task is ineligible and EEVDF pick has
> decided to go with something else.
>
> It is not "starvation" but rather you've received you for fair share
> of "proportional runtime" and now you wait. If you really want to
> follow EEVDF maybe you compute the vlag and if it is behind the
> avg_vruntime, you account it to the "target" task - that would be
> in the spirit of the EEVDF algorithm.

You're right about the terminology—it's priority inversion, not
starvation. On crediting the target: this is philosophically
interesting but has practical issues. 1) Only helps if the target's
vlag < 0 (already lagging). If the lock holder is ahead (vlag > 0), no
effect. 2) Doesn't prevent the yielder from being re-picked at the LCA
if it's still most eligible. Accounting-wise: the spinner consumes
real CPU cycles. Our penalty charges that consumption. Crediting the
target gives service it didn't receive—arguably less consistent with
proportional fairness.

Regards,
Wanpeng