[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <34b2d375-1535-41c1-9ec4-bb054641abd5@amd.com>
Date: Thu, 13 Nov 2025 15:18:30 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Wanpeng Li <kernellwp@...il.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>,
Sean Christopherson <seanjc@...gle.com>, Steven Rostedt
<rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
Lelli" <juri.lelli@...hat.com>, <linux-kernel@...r.kernel.org>,
<kvm@...r.kernel.org>, Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
oversubscribed KVM
Hello Wanpeng,
On 11/13/2025 2:03 PM, Wanpeng Li wrote:
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b4617d631549..87560f5a18b3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>>> * which yields immediately again; without the condition the vruntime
>>>> * ends up quickly running away.
>>>> */
>>>> - if (entity_eligible(cfs_rq, se)) {
>>>> + do {
>>>> + cfs_rq = cfs_rq_of(se);
>>>> +
>>>> + /*
>>>> + * Another entity will be selected at next pick.
>>>> + * Single entity on cfs_rq can never be ineligible.
>>>> + */
>>>> + if (!entity_eligible(cfs_rq, se))
>>>> + break;
>>>> +
>>>> se->vruntime = se->deadline;
>>>
>>> Setting vruntime = deadline zeros out lag. Does this cause fairness
>>> drift with repeated yields? We explicitly recalculate vlag after
>>> adjustment to preserve EEVDF invariants.
>>
>> We only push deadline when the entity is eligible. Ineligible entity
>> will break out above. Also I don't get how adding a penalty to an
>> entity in the cgroup hierarchy of the yielding task when there are
>> other runnable tasks considered as "preserve(ing) EEVDF invariants".
>
> Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> se->vruntime = new_vruntime;
> se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
So your exact implementation in yield_deboost_apply_penalty() is:
> + new_vruntime = se_y_lca->vruntime + penalty;
> +
> + /* Validity check */
> + if (new_vruntime <= se_y_lca->vruntime)
> + return;
> +
> + se_y_lca->vruntime = new_vruntime;
You've updated this vruntime to something that you've seen fit based on
your performance data - better performance is not necessarily fair.
update_curr() uses:
/* Time elapsed. */
delta_exec = now - se->exec_start;
se->exec_start = now;
curr->vruntime += calc_delta_fair(delta_exec, curr);
"delta_exec" is based on the amount of time entity has run as opposed
to the penalty calculation which simply advances the vruntime by half a
slice because someone in the hierarchy decided to yield.
Also assume the vCPU yielding and the target is on the same cgroup -
you'll advance the vruntime of task in yield_deboost_apply_penalty() and
then again in yield_task_fair()?
> + se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> + se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
There is no point in setting vlag for a running entity
> + update_min_vruntime(cfs_rq_common);
> This is the same update pattern used in update_curr(). The EEVDF
> relationship lag = (V - v) * w remains valid—vlag becomes more
> negative as vruntime increases.
Sure "V" just moves to the new avg_vruntime() to give the 0-lag
point but modifying the vruntime arbitrarily doesn't seem fair to
me.
> The presence of other runnable tasks
> doesn't affect the mathematical correctness; each entity's lag is
> computed independently relative to avg_vruntime.
>
>>
>>>
>>>> se->deadline += calc_delta_fair(se->slice, se);
>>>> - }
>>>> +
>>>> + /*
>>>> + * If we have more than one runnable task queued below
>>>> + * this cfs_rq, the next pick will likely go for a
>>>> + * different entity now that we have advanced the
>>>> + * vruntime and the deadline of the running entity.
>>>> + */
>>>> + if (cfs_rq->h_nr_runnable > 1)
>>>
>>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
>>> correctly. Shouldn't the penalty apply at the LCA of yielder and
>>> target? Otherwise the vruntime adjustment might not affect the level
>>> where they actually compete.
>>
>> So here is the case I'm going after - consider the following
>> hierarchy:
>>
>> root
>> / \
>> CG0 CG1
>> | |
>> A B
>>
>> CG* are cgroups and, [A-Z]* are tasks
>>
>> A decides to yield to B, and advances its deadline on CG0's timeline.
>> Currently, if CG0 is eligible and CG1 isn't, pick will still select
>> CG0 which will in-turn select task A and it'll yield again. This
>> cycle repeates until vruntime of CG0 turns large enough to make itself
>> ineligible and route the EEVDF pick to CG1.
>
> Yes, natural convergence works, but requires multiple cycles. Your
> h_nr_runnable > 1 stops propagation when another entity might be
> picked, but "might" depends on vruntime ordering which needs time to
> develop. Our penalty forces immediate ineligibility at the LCA. One
> penalty application vs N natural yield cycles.
>
>>
>> Now consider:
>>
>>
>> root
>> / \
>> CG0 CG1
>> / \ |
>> A C B
>>
>> Same scenario: A yields to B. A advances its vruntime and deadline
>> as a prt of yield. Now, why should CG0 sacrifice its fair share of
>> runtime for A when task B is runnable? Just because one task decided
>> to yield to another task in a different cgroup doesn't mean other
>> waiting tasks on that hierarchy suffer.
>
> You're right that C suffers unfairly if it's independent work. This is
> a known tradeoff.
So KVM is only one of the user of yield_to(). This whole debouncer
infrastructure seems to be over complicating all this. If anything
is yielding across cgroup boundary - that seems like bad
configuration and if necessary, the previous suggestion does stuff
fairly. I don't mind accounting the lost time in
yield_to_task_fair() and account it to target task but apart from
that, I don't think any of it is "fair".
Again, maybe it is only me and everyone else sees the vision having
dealt with virtualization.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists