linux-kernel - Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <34b2d375-1535-41c1-9ec4-bb054641abd5@amd.com>
Date: Thu, 13 Nov 2025 15:18:30 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Wanpeng Li <kernellwp@...il.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>,
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt
	<rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, <linux-kernel@...r.kernel.org>,
	<kvm@...r.kernel.org>, Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for
 oversubscribed KVM

Hello Wanpeng,

On 11/13/2025 2:03 PM, Wanpeng Li wrote:
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b4617d631549..87560f5a18b3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>>>          * which yields immediately again; without the condition the vruntime
>>>>          * ends up quickly running away.
>>>>          */
>>>> -       if (entity_eligible(cfs_rq, se)) {
>>>> +       do {
>>>> +               cfs_rq = cfs_rq_of(se);
>>>> +
>>>> +               /*
>>>> +                * Another entity will be selected at next pick.
>>>> +                * Single entity on cfs_rq can never be ineligible.
>>>> +                */
>>>> +               if (!entity_eligible(cfs_rq, se))
>>>> +                       break;
>>>> +
>>>>                 se->vruntime = se->deadline;
>>>
>>> Setting vruntime = deadline zeros out lag. Does this cause fairness
>>> drift with repeated yields? We explicitly recalculate vlag after
>>> adjustment to preserve EEVDF invariants.
>>
>> We only push deadline when the entity is eligible. Ineligible entity
>> will break out above. Also I don't get how adding a penalty to an
>> entity in the cgroup hierarchy of the yielding task when there are
>> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> 
> Our penalty preserves EEVDF invariants by recalculating all scheduler state:
>    se->vruntime = new_vruntime;
>    se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
>    se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
>    update_min_vruntime(cfs_rq); // maintains cfs_rq consistency

So your exact implementation in yield_deboost_apply_penalty() is:

> +	new_vruntime = se_y_lca->vruntime + penalty;
> +
> +	/* Validity check */
> +	if (new_vruntime <= se_y_lca->vruntime)
> +		return;
> +
> +	se_y_lca->vruntime = new_vruntime;

You've updated this vruntime to something that you've seen fit based on
your performance data - better performance is not necessarily fair.

update_curr() uses:

    /* Time elapsed. */
    delta_exec = now - se->exec_start;
    se->exec_start = now;

    curr->vruntime += calc_delta_fair(delta_exec, curr);


"delta_exec" is based on the amount of time entity has run as opposed
to the penalty calculation which simply advances the vruntime by half a
slice because someone in the hierarchy decided to yield.

Also assume the vCPU yielding and the target is on the same cgroup -
you'll advance the vruntime of task in yield_deboost_apply_penalty() and
then again in yield_task_fair()?


> +	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> +	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;

There is no point in setting vlag for a running entity

> +	update_min_vruntime(cfs_rq_common);

> This is the same update pattern used in update_curr(). The EEVDF
> relationship lag = (V - v) * w remains valid—vlag becomes more
> negative as vruntime increases.

Sure "V" just moves to the new avg_vruntime() to give the 0-lag
point but modifying the vruntime arbitrarily doesn't seem fair to
me.

> The presence of other runnable tasks
> doesn't affect the mathematical correctness; each entity's lag is
> computed independently relative to avg_vruntime.
> 
>>
>>>
>>>>                 se->deadline += calc_delta_fair(se->slice, se);
>>>> -       }
>>>> +
>>>> +               /*
>>>> +                * If we have more than one runnable task queued below
>>>> +                * this cfs_rq, the next pick will likely go for a
>>>> +                * different entity now that we have advanced the
>>>> +                * vruntime and the deadline of the running entity.
>>>> +                */
>>>> +               if (cfs_rq->h_nr_runnable > 1)
>>>
>>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
>>> correctly. Shouldn't the penalty apply at the LCA of yielder and
>>> target? Otherwise the vruntime adjustment might not affect the level
>>> where they actually compete.
>>
>> So here is the case I'm going after - consider the following
>> hierarchy:
>>
>>      root
>>     /    \
>>   CG0   CG1
>>    |     |
>>    A     B
>>
>>   CG* are cgroups and, [A-Z]* are tasks
>>
>> A decides to yield to B, and advances its deadline on CG0's timeline.
>> Currently, if CG0 is eligible and CG1 isn't, pick will still select
>> CG0 which will in-turn select task A and it'll yield again. This
>> cycle repeates until vruntime of CG0 turns large enough to make itself
>> ineligible and route the EEVDF pick to CG1.
> 
> Yes, natural convergence works, but requires multiple cycles. Your
> h_nr_runnable > 1 stops propagation when another entity might be
> picked, but "might" depends on vruntime ordering which needs time to
> develop. Our penalty forces immediate ineligibility at the LCA. One
> penalty application vs N natural yield cycles.
> 
>>
>> Now consider:
>>
>>
>>        root
>>       /    \
>>     CG0   CG1
>>    /   \   |
>>   A     C  B
>>
>> Same scenario: A yields to B. A advances its vruntime and deadline
>> as a prt of yield. Now, why should CG0 sacrifice its fair share of
>> runtime for A when task B is runnable? Just because one task decided
>> to yield to another task in a different cgroup doesn't mean other
>> waiting tasks on that hierarchy suffer.
> 
> You're right that C suffers unfairly if it's independent work. This is
> a known tradeoff.

So KVM is only one of the user of yield_to(). This whole debouncer
infrastructure seems to be over complicating all this. If anything
is yielding across cgroup boundary - that seems like bad
configuration and if necessary, the previous suggestion does stuff
fairly. I don't mind accounting the lost time in
yield_to_task_fair() and account it to target task but apart from
that, I don't think any of it is "fair".

Again, maybe it is only me and everyone else sees the vision having
dealt with virtualization.

-- 
Thanks and Regards,
Prateek