[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANRm+Cxrk2XEn+uVTv2=-1T101npyg4eOmedG_fehqFBVjJRag@mail.gmail.com>
Date: Thu, 13 Nov 2025 21:56:34 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>,
Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
oversubscribed KVM
Hi Prateek,
On Thu, 13 Nov 2025 at 17:48, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/13/2025 2:03 PM, Wanpeng Li wrote:
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b4617d631549..87560f5a18b3 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>>> * which yields immediately again; without the condition the vruntime
> >>>> * ends up quickly running away.
> >>>> */
> >>>> - if (entity_eligible(cfs_rq, se)) {
> >>>> + do {
> >>>> + cfs_rq = cfs_rq_of(se);
> >>>> +
> >>>> + /*
> >>>> + * Another entity will be selected at next pick.
> >>>> + * Single entity on cfs_rq can never be ineligible.
> >>>> + */
> >>>> + if (!entity_eligible(cfs_rq, se))
> >>>> + break;
> >>>> +
> >>>> se->vruntime = se->deadline;
> >>>
> >>> Setting vruntime = deadline zeros out lag. Does this cause fairness
> >>> drift with repeated yields? We explicitly recalculate vlag after
> >>> adjustment to preserve EEVDF invariants.
> >>
> >> We only push deadline when the entity is eligible. Ineligible entity
> >> will break out above. Also I don't get how adding a penalty to an
> >> entity in the cgroup hierarchy of the yielding task when there are
> >> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> >
> > Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> > se->vruntime = new_vruntime;
> > se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> > se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> > update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
>
> So your exact implementation in yield_deboost_apply_penalty() is:
>
> > + new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > + /* Validity check */
> > + if (new_vruntime <= se_y_lca->vruntime)
> > + return;
> > +
> > + se_y_lca->vruntime = new_vruntime;
>
> You've updated this vruntime to something that you've seen fit based on
> your performance data - better performance is not necessarily fair.
>
> update_curr() uses:
>
> /* Time elapsed. */
> delta_exec = now - se->exec_start;
> se->exec_start = now;
>
> curr->vruntime += calc_delta_fair(delta_exec, curr);
>
>
> "delta_exec" is based on the amount of time entity has run as opposed
> to the penalty calculation which simply advances the vruntime by half a
> slice because someone in the hierarchy decided to yield.
CFS already separates time accounting from policy enforcement.
place_entity() modifies vruntime based on lag without time
passage—it's placement policy, not time accounting. Similarly,
yield_task_fair() advances the deadline without consuming time—policy
to trigger reschedule. Our penalty follows this established pattern:
bounded vruntime adjustment to implement yield_to() semantics in
hierarchical scheduling. Time accounting ( update_curr ) and
scheduling policy (placement, yielding, penalties) are distinct
mechanisms in CFS.
>
> Also assume the vCPU yielding and the target is on the same cgroup -
> you'll advance the vruntime of task in yield_deboost_apply_penalty() and
> then again in yield_task_fair()?
This is deliberate. When tasks share the same cgroup, they need both
hierarchy-level and leaf-level adjustments.
yield_deboost_apply_penalty() positions the task in cgroup timeline
(affects picking at that level), while yield_task_fair() advances the
deadline (triggers immediate reschedule). Without both, same-cgroup
yield loses effectiveness—the task would be repicked despite yielding.
The double adjustment ensures yield works at both the task level and
across hierarchy levels. This matches CFS's multi-level scheduling
philosophy.
>
>
> > + se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> > + se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
>
> There is no point in setting vlag for a running entity
Maintaining invariants when modifying scheduler state is standard
practice throughout fair.c. reweight_entity() updates vlag for curr
when changing weights to preserve the lag relationship. We follow the
same principle—when artificially advancing vruntime, recalculate vlag
to maintain vlag = V - v . This prevents inconsistency when the entity
later dequeues. It's defensive correctness at negligible cost. The
alternative—leaving vlag stale—risks subtle bugs when scheduler state
assumptions are violated.
>
> > + update_min_vruntime(cfs_rq_common);
>
> > This is the same update pattern used in update_curr(). The EEVDF
> > relationship lag = (V - v) * w remains valid—vlag becomes more
> > negative as vruntime increases.
>
> Sure "V" just moves to the new avg_vruntime() to give the 0-lag
> point but modifying the vruntime arbitrarily doesn't seem fair to
> me.
yield_to() API explicitly requests directed unfairness. CFS already
implements unfairness mechanisms: nice values, cgroup weights,
set_next_buddy() immediate preference. Without our mechanism,
yield_to() silently fails across cgroups—buddy hints vanish at
hierarchy boundaries where EEVDF makes independent decisions. We make
the documented API functional. The real question: should yield_to()
work in production environments (nested cgroups)? If yes, vruntime
adjustment is necessary. If not, deprecate the API.
>
> > The presence of other runnable tasks
> > doesn't affect the mathematical correctness; each entity's lag is
> > computed independently relative to avg_vruntime.
> >
> >>
> >>>
> >>>> se->deadline += calc_delta_fair(se->slice, se);
> >>>> - }
> >>>> +
> >>>> + /*
> >>>> + * If we have more than one runnable task queued below
> >>>> + * this cfs_rq, the next pick will likely go for a
> >>>> + * different entity now that we have advanced the
> >>>> + * vruntime and the deadline of the running entity.
> >>>> + */
> >>>> + if (cfs_rq->h_nr_runnable > 1)
> >>>
> >>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> >>> correctly. Shouldn't the penalty apply at the LCA of yielder and
> >>> target? Otherwise the vruntime adjustment might not affect the level
> >>> where they actually compete.
> >>
> >> So here is the case I'm going after - consider the following
> >> hierarchy:
> >>
> >> root
> >> / \
> >> CG0 CG1
> >> | |
> >> A B
> >>
> >> CG* are cgroups and, [A-Z]* are tasks
> >>
> >> A decides to yield to B, and advances its deadline on CG0's timeline.
> >> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> >> CG0 which will in-turn select task A and it'll yield again. This
> >> cycle repeates until vruntime of CG0 turns large enough to make itself
> >> ineligible and route the EEVDF pick to CG1.
> >
> > Yes, natural convergence works, but requires multiple cycles. Your
> > h_nr_runnable > 1 stops propagation when another entity might be
> > picked, but "might" depends on vruntime ordering which needs time to
> > develop. Our penalty forces immediate ineligibility at the LCA. One
> > penalty application vs N natural yield cycles.
> >
> >>
> >> Now consider:
> >>
> >>
> >> root
> >> / \
> >> CG0 CG1
> >> / \ |
> >> A C B
> >>
> >> Same scenario: A yields to B. A advances its vruntime and deadline
> >> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> >> runtime for A when task B is runnable? Just because one task decided
> >> to yield to another task in a different cgroup doesn't mean other
> >> waiting tasks on that hierarchy suffer.
> >
> > You're right that C suffers unfairly if it's independent work. This is
> > a known tradeoff.
>
> So KVM is only one of the user of yield_to(). This whole debouncer
> infrastructure seems to be over complicating all this. If anything
> is yielding across cgroup boundary - that seems like bad
> configuration and if necessary, the previous suggestion does stuff
> fairly. I don't mind accounting the lost time in
> yield_to_task_fair() and account it to target task but apart from
> that, I don't think any of it is "fair".
Time-transfer fails fundamentally: lock holders often have higher
vruntime (ran more), so crediting them backwards doesn't change EEVDF
pick order. Our penalty pushes yielder back—effective regardless. The
infrastructure addresses real measured problems: rate limiting
prevents overhead, debounce stops ping-pong accumulation, LCA
targeting fixes hierarchy picking. Nested cgroups are production
standard (systemd, containers, cloud)—not misconfiguration.
Performance gains prove yield_to was broken. Open to simplifications,
but they must actually solve the hierarchical scheduling problem.
Regards,
Wanpeng
Powered by blists - more mailing lists