linux-kernel - Re: [PATCH] sched/fair: Add null pointer check to pick_next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250422141351.GG14170@noisy.programming.kicks-ass.net>
Date: Tue, 22 Apr 2025 16:13:51 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Omar Sandoval <osandov@...ndov.com>
Cc: Rik van Riel <riel@...riel.com>, Chris Mason <clm@...a.com>,
	Pat Cody <pat@...cody.io>, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, linux-kernel@...r.kernel.org, patcody@...a.com,
	kernel-team@...a.com, Breno Leitao <leitao@...ian.org>
Subject: Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()

On Mon, Apr 21, 2025 at 05:06:45PM -0700, Omar Sandoval wrote:

> Hey, Peter,
> 
> We haven't been able to test your latest patch, but I dug through some
> core dumps from crashes with your initial zero_vruntime patch. It looks
> like on just about all of them, the entity vruntimes are way too spread
> out, so we would get overflows regardless of what we picked as
> zero_vruntime.
> 
> As a representative example, we have a cfs_rq with 3 entities with the
> follow vruntimes and (scaled down) weights:
> 
> vruntime           weight
> 39052385155836636  2      (curr)
> 43658311782076206  2
> 42824722322062111  4886
> 
> The difference between the minimum and maximum is 4605926626239570,

Right, that is quite beyond usable. The key question at this point
is how did we get here...

> which is 53 bits. The total load is 4890. Even if you picked
> zero_vruntime to be equidistant from the minimum and maximum, the
> (vruntime - zero_vruntime) * load calculation in entity_eligible() is
> doomed to overflow.
> 
> That range in vruntime seems too absurd to be due to only to running too
> long without preemption. We're only seeing these crashes on internal
> node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
> all leads me to suspect reweight_entity().
> 
> Specifically, this line in reweight_entity():
> 
> 	se->vlag = div_s64(se->vlag * se->load.weight, weight);
> 
> seems like it could create a very large vlag, which could cause
> place_entity() to adjust vruntime by a large value.

Right, I fixed that not too long ago. At the time I convinced myself
clipping there wasn't needed (in fact, it would lead to some other
artifacts iirc). Let me go review that decision :-)

> (place_entity() has
> a similarly suspect adjustment on se->vlag, only update_entity_lag()
> clamps it).

place_entity() is a bit tricky -- but should be well behaved. The
problem is that by adding an entity, you affect the average, because the
average shifts, the lag after placement is different than the lag you
started with.

The scaling there is to ensure the lag after placement reflects the
initial lag. Much like how update_entity_lag() is the lag before
removal, place_entity() ensures the lag is measured after insertion.

> I'll try to reproduce something like this, but do you have any thoughts
> on this theory in the meantime?

It seems reasonable, but as said, let me go look too :-)