linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+Cxrk2XEn+uVTv2=-1T101npyg4eOmedG_fehqFBVjJRag@mail.gmail.com>
Date: Thu, 13 Nov 2025 21:56:34 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hi Prateek,

On Thu, 13 Nov 2025 at 17:48, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/13/2025 2:03 PM, Wanpeng Li wrote:
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b4617d631549..87560f5a18b3 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>>>          * which yields immediately again; without the condition the vruntime
> >>>>          * ends up quickly running away.
> >>>>          */
> >>>> -       if (entity_eligible(cfs_rq, se)) {
> >>>> +       do {
> >>>> +               cfs_rq = cfs_rq_of(se);
> >>>> +
> >>>> +               /*
> >>>> +                * Another entity will be selected at next pick.
> >>>> +                * Single entity on cfs_rq can never be ineligible.
> >>>> +                */
> >>>> +               if (!entity_eligible(cfs_rq, se))
> >>>> +                       break;
> >>>> +
> >>>>                 se->vruntime = se->deadline;
> >>>
> >>> Setting vruntime = deadline zeros out lag. Does this cause fairness
> >>> drift with repeated yields? We explicitly recalculate vlag after
> >>> adjustment to preserve EEVDF invariants.
> >>
> >> We only push deadline when the entity is eligible. Ineligible entity
> >> will break out above. Also I don't get how adding a penalty to an
> >> entity in the cgroup hierarchy of the yielding task when there are
> >> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> >
> > Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> >    se->vruntime = new_vruntime;
> >    se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> >    se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> >    update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
>
> So your exact implementation in yield_deboost_apply_penalty() is:
>
> > +     new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > +     /* Validity check */
> > +     if (new_vruntime <= se_y_lca->vruntime)
> > +             return;
> > +
> > +     se_y_lca->vruntime = new_vruntime;
>
> You've updated this vruntime to something that you've seen fit based on
> your performance data - better performance is not necessarily fair.
>
> update_curr() uses:
>
>     /* Time elapsed. */
>     delta_exec = now - se->exec_start;
>     se->exec_start = now;
>
>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>
>
> "delta_exec" is based on the amount of time entity has run as opposed
> to the penalty calculation which simply advances the vruntime by half a
> slice because someone in the hierarchy decided to yield.

CFS already separates time accounting from policy enforcement.
place_entity() modifies vruntime based on lag without time
passage—it's placement policy, not time accounting. Similarly,
yield_task_fair() advances the deadline without consuming time—policy
to trigger reschedule. Our penalty follows this established pattern:
bounded vruntime adjustment to implement yield_to() semantics in
hierarchical scheduling. Time accounting ( update_curr ) and
scheduling policy (placement, yielding, penalties) are distinct
mechanisms in CFS.

>
> Also assume the vCPU yielding and the target is on the same cgroup -
> you'll advance the vruntime of task in yield_deboost_apply_penalty() and
> then again in yield_task_fair()?

This is deliberate. When tasks share the same cgroup, they need both
hierarchy-level and leaf-level adjustments.
yield_deboost_apply_penalty() positions the task in cgroup timeline
(affects picking at that level), while yield_task_fair() advances the
deadline (triggers immediate reschedule). Without both, same-cgroup
yield loses effectiveness—the task would be repicked despite yielding.
The double adjustment ensures yield works at both the task level and
across hierarchy levels. This matches CFS's multi-level scheduling
philosophy.

>
>
> > +     se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> > +     se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
>
> There is no point in setting vlag for a running entity

Maintaining invariants when modifying scheduler state is standard
practice throughout fair.c. reweight_entity() updates vlag for curr
when changing weights to preserve the lag relationship. We follow the
same principle—when artificially advancing vruntime, recalculate vlag
to maintain vlag = V - v . This prevents inconsistency when the entity
later dequeues. It's defensive correctness at negligible cost. The
alternative—leaving vlag stale—risks subtle bugs when scheduler state
assumptions are violated.

>
> > +     update_min_vruntime(cfs_rq_common);
>
> > This is the same update pattern used in update_curr(). The EEVDF
> > relationship lag = (V - v) * w remains valid—vlag becomes more
> > negative as vruntime increases.
>
> Sure "V" just moves to the new avg_vruntime() to give the 0-lag
> point but modifying the vruntime arbitrarily doesn't seem fair to
> me.

yield_to() API explicitly requests directed unfairness. CFS already
implements unfairness mechanisms: nice values, cgroup weights,
set_next_buddy() immediate preference. Without our mechanism,
yield_to() silently fails across cgroups—buddy hints vanish at
hierarchy boundaries where EEVDF makes independent decisions. We make
the documented API functional. The real question: should yield_to()
work in production environments (nested cgroups)? If yes, vruntime
adjustment is necessary. If not, deprecate the API.

>
> > The presence of other runnable tasks
> > doesn't affect the mathematical correctness; each entity's lag is
> > computed independently relative to avg_vruntime.
> >
> >>
> >>>
> >>>>                 se->deadline += calc_delta_fair(se->slice, se);
> >>>> -       }
> >>>> +
> >>>> +               /*
> >>>> +                * If we have more than one runnable task queued below
> >>>> +                * this cfs_rq, the next pick will likely go for a
> >>>> +                * different entity now that we have advanced the
> >>>> +                * vruntime and the deadline of the running entity.
> >>>> +                */
> >>>> +               if (cfs_rq->h_nr_runnable > 1)
> >>>
> >>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> >>> correctly. Shouldn't the penalty apply at the LCA of yielder and
> >>> target? Otherwise the vruntime adjustment might not affect the level
> >>> where they actually compete.
> >>
> >> So here is the case I'm going after - consider the following
> >> hierarchy:
> >>
> >>      root
> >>     /    \
> >>   CG0   CG1
> >>    |     |
> >>    A     B
> >>
> >>   CG* are cgroups and, [A-Z]* are tasks
> >>
> >> A decides to yield to B, and advances its deadline on CG0's timeline.
> >> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> >> CG0 which will in-turn select task A and it'll yield again. This
> >> cycle repeates until vruntime of CG0 turns large enough to make itself
> >> ineligible and route the EEVDF pick to CG1.
> >
> > Yes, natural convergence works, but requires multiple cycles. Your
> > h_nr_runnable > 1 stops propagation when another entity might be
> > picked, but "might" depends on vruntime ordering which needs time to
> > develop. Our penalty forces immediate ineligibility at the LCA. One
> > penalty application vs N natural yield cycles.
> >
> >>
> >> Now consider:
> >>
> >>
> >>        root
> >>       /    \
> >>     CG0   CG1
> >>    /   \   |
> >>   A     C  B
> >>
> >> Same scenario: A yields to B. A advances its vruntime and deadline
> >> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> >> runtime for A when task B is runnable? Just because one task decided
> >> to yield to another task in a different cgroup doesn't mean other
> >> waiting tasks on that hierarchy suffer.
> >
> > You're right that C suffers unfairly if it's independent work. This is
> > a known tradeoff.
>
> So KVM is only one of the user of yield_to(). This whole debouncer
> infrastructure seems to be over complicating all this. If anything
> is yielding across cgroup boundary - that seems like bad
> configuration and if necessary, the previous suggestion does stuff
> fairly. I don't mind accounting the lost time in
> yield_to_task_fair() and account it to target task but apart from
> that, I don't think any of it is "fair".

Time-transfer fails fundamentally: lock holders often have higher
vruntime (ran more), so crediting them backwards doesn't change EEVDF
pick order. Our penalty pushes yielder back—effective regardless. The
infrastructure addresses real measured problems: rate limiting
prevents overhead, debounce stops ping-pong accumulation, LCA
targeting fixes hierarchy picking. Nested cgroups are production
standard (systemd, containers, cloud)—not misconfiguration.
Performance gains prove yield_to was broken. Open to simplifications,
but they must actually solve the hierarchical scheduling problem.

Regards,
Wanpeng