linux-kernel - Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef7d974e-7fcd-4e30-8a0d-8b97e00478bc@amd.com>
Date: Wed, 12 Nov 2025 12:55:23 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Wanpeng Li <kernellwp@...il.com>, Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, "Paolo
 Bonzini" <pbonzini@...hat.com>, Sean Christopherson <seanjc@...gle.com>
CC: Steven Rostedt <rostedt@...dmis.org>, Vincent Guittot
	<vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, Wanpeng Li
	<wanpengli@...cent.com>
Subject: Re: [PATCH 04/10] sched/fair: Add penalty calculation and application
 logic

Hello Wanpeng,

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> +/*
> + * Calculate penalty with debounce logic for EEVDF yield deboost.
> + * Computes vruntime penalty based on fairness gap (need) plus granularity,
> + * applies queue-size-based caps to prevent excessive penalties in small queues,
> + * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
> + * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
> + */
> +static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> +				    struct sched_entity *se_t_lca, struct sched_entity *se_t,
> +				    int nr_queued)
> +{
> +	u64 gran, need, penalty, maxp;
> +	u64 gran_floor;
> +	u64 weighted_need, base;
> +
> +	gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
> +	/* Low-bound safeguard for gran when slice is abnormally small */
> +	gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
> +	if (gran < gran_floor)

Is this even possible?

> +		gran = gran_floor;
> +
> +	need = 0;
> +	if (se_t_lca->vruntime > se_y_lca->vruntime)
> +		need = se_t_lca->vruntime - se_y_lca->vruntime;

So I'm assuming you want the yielding task's vruntime to
cross the target's vruntime simply because one task somewhere
down the hierarchy said so.

> +
> +	/* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
> +	penalty = gran;

So at the very least I see it getting weighted(base_slice / 2) penalty
... 

> +	if (need) {
> +		/* weighted_need = need + 10% */
> +		weighted_need = need + need / 10;
> +		/* clamp to avoid overflow when adding to gran (still capped later) */
> +		if (weighted_need > U64_MAX - penalty)
> +			weighted_need = U64_MAX - penalty;
> +		penalty += weighted_need;

... if not more ...

> +	}
> +
> +	/* Apply debounce via helper to avoid ping-pong */
> +	penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);

... since without debounce, penalty remains same.

> +
> +	/* Upper bound (cap): slightly more aggressive for mid-size queues */
> +	if (nr_queued == 2)
> +		maxp = gran * 6;		/* Strongest push for 2-task ping-pong */
> +	else if (nr_queued == 3)
> +		maxp = gran * 4;		/* 4.0 * gran */
> +	else if (nr_queued <= 6)
> +		maxp = (gran * 5) / 2;		/* 2.5 * gran */
> +	else if (nr_queued <= 8)
> +		maxp = gran * 2;		/* 2.0 * gran */
> +	else if (nr_queued <= 12)
> +		maxp = (gran * 3) / 2;		/* 1.5 * gran */
> +	else
> +		maxp = gran;			/* 1.0 * gran */

And all the nr_queued calculations are based on the entities queued
and not the "h_nr_queued" so we can have a boat load of tasks to
run above but since one task decided to call yield_to() let us make
them all starve a little?

> +
> +	if (penalty < gran)
> +		penalty = gran;
> +	if (penalty > maxp)
> +		penalty = maxp;
> +
> +	/* If no need, apply refined baseline push (low risk + mid risk combined). */
> +	if (need == 0) {
> +		/*
> +		 * Baseline multiplier for need==0:
> +		 *   2        -> 1.00 * gran
> +		 *   3        -> 0.9375 * gran
> +		 *   4–6      -> 0.625 * gran
> +		 *   7–8      -> 0.50  * gran
> +		 *   9–12     -> 0.375 * gran
> +		 *   >12      -> 0.25  * gran
> +		 */
> +		base = gran;
> +		if (nr_queued == 3)
> +			base = (gran * 15) / 16;	/* 0.9375 */
> +		else if (nr_queued >= 4 && nr_queued <= 6)
> +			base = (gran * 5) / 8;		/* 0.625 */
> +		else if (nr_queued >= 7 && nr_queued <= 8)
> +			base = gran / 2;		/* 0.5 */
> +		else if (nr_queued >= 9 && nr_queued <= 12)
> +			base = (gran * 3) / 8;		/* 0.375 */
> +		else if (nr_queued > 12)
> +			base = gran / 4;		/* 0.25 */
> +
> +		if (penalty < base)
> +			penalty = base;
> +	}
> +
> +	return penalty;
> +}
> +
> +/*
> + * Apply penalty and update EEVDF fields for scheduler consistency.
> + * Safely applies vruntime penalty with overflow protection, then updates
> + * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
> + * scheduler state consistency. Returns true on successful application,
> + * false if penalty cannot be safely applied.
> + */
> +static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> +				 struct cfs_rq *cfs_rq_common, u64 penalty)
> +{
> +	u64 new_vruntime;
> +
> +	/* Overflow protection */
> +	if (se_y_lca->vruntime > (U64_MAX - penalty))
> +		return;
> +
> +	new_vruntime = se_y_lca->vruntime + penalty;
> +
> +	/* Validity check */
> +	if (new_vruntime <= se_y_lca->vruntime)
> +		return;
> +
> +	se_y_lca->vruntime = new_vruntime;
> +	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);

And with that we update vruntime to an arbitrary value simply
because one task in the hierarchy decided to call yield_to().

Since we are on the topic, you are also missing an update_curr()
which is only done in yield_task_fair() so you are actually
looking at old vruntime for the yielding entity.

> +	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
> +	update_min_vruntime(cfs_rq_common);
> +}
> +
>  /*
>   * sched_yield() is very simple
>   */

-- 
Thanks and Regards,
Prateek