linux-kernel - Re: [PATCH 2/2] sched/fair: Reimplement NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <djbwr5uwkzmamulzo6juejvag6kv3ug5nxmux75vo5jmma32pw@eguk2xwaos4f>
Date: Fri, 31 Oct 2025 10:27:51 +0000
From: Mel Gorman <mgorman@...hsingularity.net>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Valentin Schneider <vschneid@...hat.com>, Chris Mason <clm@...a.com>
Subject: Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with
 EEVDF goals

On Thu, Oct 30, 2025 at 10:10:58AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > +static inline enum preempt_wakeup_action
> > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> > +		 struct sched_entity *pse, struct sched_entity *se)
> > +{
> > +	bool pse_before;
> > +
> > +	/*
> > +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> > +	 * there is shared data as exec often follow fork. Do not
> > +	 * preempt for tasks that are sched_delayed as it would violate
> > +	 * EEVDF to forcibly queue an ineligible task.
> > +	 */
> > +	if (!sched_feat(NEXT_BUDDY) ||
> 
> This seems wrong, that would mean wakeup preemption gets killed the
> moment you disable NEXT_BUDDY, that can't be right.
> 

Correct, the check is bogus.

> > +	    (wake_flags & WF_FORK) ||
> > +	    (pse->sched_delayed)) {
> > +		return PREEMPT_WAKEUP_NONE;
> > +	}
> > +
> > +	/* Reschedule if waker is no longer eligible. */
> > +	if (!entity_eligible(cfs_rq, se))
> > +		return PREEMPT_WAKEUP_RESCHED;
> 
> That comment isn't accurate, unless you add: && in_task(). That is, if
> this is an interrupt doing the wakeup, it has nothing to do with
> current.
> 

That was a complete oversight.

> > +	/*
> > +	 * Keep existing buddy if the deadline is sooner than pse.
> > +	 * The downside is that the older buddy may be cache cold
> > +	 * but that is unpredictable where as an earlier deadline
> > +	 * is absolute.
> > +	 */
> > +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> > +		return PREEMPT_WAKEUP_NONE;
> 
> But if previously we set next and didn't preempt, we should try again,
> maybe it has more success now. That is, should this not be _NEXT?
> 

The context of why the original buddy was set is now lost but you're
right, it is more straight-forward to reconsider the old buddy. It's
more in line with EEVDF objectives and cache residency and future
hotness requires crystal ball instructions.

> > +
> > +	set_next_buddy(pse);
> > +
> > +	/*
> > +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> > +	 * strictly enforced because the hint is either misunderstood or
> > +	 * multiple tasks must be woken up.
> > +	 */
> > +	pse_before = entity_before(pse, se);
> > +	if (wake_flags & WF_SYNC) {
> > +		u64 delta = rq_clock_task(rq) - se->exec_start;
> > +		u64 threshold = sysctl_sched_migration_cost;
> > +
> > +		/*
> > +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> > +		 * happens even though it is likely harmless.
> > +		 */
> > +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
> 
> s/|/&/ ?
> 

Bah, thanks.

> > +		if ((s64)delta < 0)
> > +			delta = 0;
> > +
> > +		/*
> > +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> > +		 * CPU when they could run on other CPUs. Reduce the
> > +		 * threshold before preemption is allowed to an
> > +		 * arbitrary lower value as it is more likely (but not
> > +		 * guaranteed) the waker requires the wakee to finish.
> > +		 */
> > +		if (wake_flags & WF_RQ_SELECTED)
> > +			threshold >>= 2;
> > +
> > +		/*
> > +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> > +		 * batch wakeups to be issued.
> > +		 */
> > +		if (pse_before && delta >= threshold)
> > +			return PREEMPT_WAKEUP_RESCHED;
> > +
> > +		return PREEMPT_WAKEUP_NONE;
> > +	}
> > +
> > +	return PREEMPT_WAKEUP_NEXT;
> > +}
> 
> Add to this that AFAICT your patch ends up doing:
> 
> 	__pick_eevdf(.protect = false) == pse
> 
> which unconditionally disables the slice protection feature.
> 

Yes, trying to converge PREEMPT_SHORT with NEXT_BUDDY during prototyping
was a poor decision because it led to mistakes like this.

-- 
Mel Gorman
SUSE Labs