linux-kernel - Re: [PATCH] sched/eevdf: Fix se->slice being set to U64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250425103249.GO18306@noisy.programming.kicks-ass.net>
Date: Fri, 25 Apr 2025 12:32:49 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Omar Sandoval <osandov@...ndov.com>
Cc: Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
	Rik van Riel <riel@...riel.com>, Chris Mason <clm@...a.com>,
	kernel-team@...com, Pat Cody <pat@...cody.io>,
	Breno Leitao <leitao@...ian.org>
Subject: Re: [PATCH] sched/eevdf: Fix se->slice being set to U64_MAX and
 resulting crash

On Fri, Apr 25, 2025 at 01:51:24AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@...com>
> 
> There is a code path in dequeue_entities() that can set the slice of a
> sched_entity to U64_MAX, which sometimes results in a crash.
> 
> The offending case is when dequeue_entities() is called to dequeue a
> delayed group entity, and then the entity's parent's dequeue is delayed.
> In that case:
> 
> 1. In the if (entity_is_task(se)) else block at the beginning of
>    dequeue_entities(), slice is set to
>    cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
>    it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.

Whoopsy..

> 2. The first for_each_sched_entity() loop dequeues the entity.
> 3. If the entity was its parent's only child, then the next iteration
>    tries to dequeue the parent.
> 4. If the parent's dequeue needs to be delayed, then it breaks from the
>    first for_each_sched_entity() loop _without updating slice_.
> 5. The second for_each_sched_entity() loop sets the parent's ->slice to
>    the saved slice, which is still U64_MAX.
> 
> This throws off subsequent calculations with potentially catastrophic
> results. A manifestation we saw in production was:
> 
> 6. In update_entity_lag(), se->slice is used to calculate limit, which
>    ends up as a huge negative number.
> 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
>    is negative, vlag > limit, so se->vlag is set to the same huge
>    negative number.
> 8. In place_entity(), se->vlag is scaled, which overflows and results in
>    another huge (positive or negative) number.
> 9. The adjusted lag is subtracted from se->vruntime, which increases or
>    decreases se->vruntime by a huge number.
> 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
>     incorrectly returns false because the vruntime is so far from the
>     other vruntimes on the queue, causing the
>     (vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
> 11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
> 12. pick_next_entity() tries to dereference the return value of
>     pick_eevdf() and crashes.

Impressive fail chain that.

> Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
> huge vruntime ranges and bogus vlag values, and I also traced se->slice
> being set to U64_MAX on live systems (which was usually "benign" since
> the rest of the runqueue needed to be in a particular state to crash).
> 
> Fix it in dequeue_entities() by always setting slice from the first
> non-empty cfs_rq.
> 
> Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> Signed-off-by: Omar Sandoval <osandov@...com>

Thanks!