[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d38c61098b426777c1a748cf1baf8e57c41c334.camel@surriel.com>
Date: Wed, 02 Apr 2025 10:59:09 -0400
From: Rik van Riel <riel@...riel.com>
To: Peter Zijlstra <peterz@...radead.org>, Pat Cody <pat@...cody.io>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
patcody@...a.com, kernel-team@...a.com, stable@...r.kernel.org, Breno
Leitao <leitao@...ian.org>
Subject: Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
On Mon, 2025-03-24 at 12:56 +0100, Peter Zijlstra wrote:
> On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
> > pick_eevdf() can return null, resulting in a null pointer
> > dereference
> > crash in pick_next_entity()
>
> If it returns NULL while nr_queued, something is really badly wrong.
>
> Your check will hide this badness.
Looking at the numbers, I suspect vruntime_eligible()
is simply not allowing us to run the left-most entity
in the rb tree.
At the root level we are seeing these numbers:
*(struct cfs_rq *)0xffff8882b3b80000 = {
.load = (struct load_weight){
.weight = (unsigned long)4750106,
.inv_weight = (u32)0,
},
.nr_running = (unsigned int)3,
.h_nr_running = (unsigned int)3,
.idle_nr_running = (unsigned int)0,
.idle_h_nr_running = (unsigned int)0,
.h_nr_delayed = (unsigned int)0,
.avg_vruntime = (s64)-2206158374744070955,
.avg_load = (u64)4637,
.min_vruntime = (u64)12547674988423219,
Meanwhile, the cfs_rq->curr entity has a weight of
4699124, a vruntime of 12071905127234526, and a
vlag of -2826239998
The left node entity in the cfs_rq has a weight
of 107666, a vruntime of 16048555717648580,
and a vlag of -1338888
I cannot for the life of me figure out how the
avg_vruntime number is so out of whack from what
the vruntime numbers of the sched entities on the
runqueue look like.
The avg_vruntime code is confusing me. On the
one hand the vruntime number is multiplied by
the sched entity weight when adding to or
subtracting to avg_vruntime, but on the other
hand vruntime_eligible scales the comparison
by the cfs_rq->avg_load number.
What even protects the load number in vruntime_eligible
from going negative in certain cases, when the current
entity's entity_key is a negative value?
The latter is probably not the bug we're seeing now, but
I don't understand how that is supposed to behave.
--
All Rights Reversed.
Powered by blists - more mailing lists