lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAbdlTISuaJnc5AG@telecaster>
Date: Mon, 21 Apr 2025 17:06:45 -0700
From: Omar Sandoval <osandov@...ndov.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Rik van Riel <riel@...riel.com>, Chris Mason <clm@...a.com>,
	Pat Cody <pat@...cody.io>, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, linux-kernel@...r.kernel.org, patcody@...a.com,
	kernel-team@...a.com, Breno Leitao <leitao@...ian.org>
Subject: Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()

On Fri, Apr 18, 2025 at 04:49:08PM -0700, Omar Sandoval wrote:
> On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote:
> > 
> > > The most common warning by far, hitting
> > > about 90% of the time we hit anything
> > > in avg_vruntime_validate is the
> > > WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> > > 
> > > The most common code path to getting there,
> > > covering about 85% of the cases:
> > > 
> > > avg_vruntime_validate
> > > avg_vruntime_sub
> > > __dequeue_entity
> > > set_next_entity
> > > pick_task_fair
> > > pick_next_task_fair
> > > __pick_next_task
> > > pick_next_task
> > > __schedule
> > > schedule
> > 
> > (I'm assuming zero_vruntime patch here, the stock kernel has more
> > problems vs min_vruntime getting stuck in the future etc..)
> > 
> > So I have a theory about this. Key is that you're running a PREEMPT-NONE
> > kernel.
> > 
> > There is one important site the overflow patch does not cover:
> > avg_vruntime_update().
> > 
> > However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
> > reclaim and OOM) to have very long (in-kernel) runtimes without
> > scheduling.
> > 
> > (I suppose this should be visible by tracing sched_switch)
> > 
> > Were this to happen, then avg_vruntime_add() gets to deal with 'key *
> > weight' for a relatively large key. But per the overflow checks there,
> > this isn't hitting (much).
> > 
> > But then we try and update zero_vruntime, and that is doing 'delta *
> > cfs_rq->avg_load', and avg_load is a larger number and might just cause
> > overflow. We don't have a check there.
> > 
> > If that overflows then avg_vruntime is buggered, but none of the
> > individual tree entities hit overflow, exactly as you observe.
> > 
> > I've modified the zero_vruntime patch a little (new one below) to update
> > zero_vruntime in update_curr(). Additionally, I have every tick re-align
> > it with the tree avg (the update and the tree can drift due to numerical
> > funnies).
> > 
> > This should ensure these very long in-kernel runtimes are broken up in
> > at most tick sized chunks, and by keeping zero_vruntime more or less
> > around the actual 0-lag point, the key values are minimized.
> > 
> > I should probably go play with a kernel module that spins for a few
> > seconds with preemption disabled, see if I can make it go BOOM :-) But
> > I've not yet done so.

[snip]

> >  static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> > @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
> >  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >  	avg_vruntime_add(cfs_rq, se);
> > +	update_zero_vruntime(cfs_rq);
> 
> Won't this double-count cfs_rq->curr in the avg_vruntime() calculation
> in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity()
> calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to
> NULL. So the avg_vruntime_add() call will add entity_key(cfs_rq->curr)
> to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load.
> Then update_zero_vruntime() calls avg_vruntime(), which still sees
> curr->on_rq and will add curr's key and weight again. This throws
> zero_vruntime off, maybe by enough to bust zero_vruntime and/or
> avg_vruntime?
> 
> Should the call to update_zero_vruntime() go before avg_vruntime_add()?
> 
> Thanks,
> Omar

Hey, Peter,

We haven't been able to test your latest patch, but I dug through some
core dumps from crashes with your initial zero_vruntime patch. It looks
like on just about all of them, the entity vruntimes are way too spread
out, so we would get overflows regardless of what we picked as
zero_vruntime.

As a representative example, we have a cfs_rq with 3 entities with the
follow vruntimes and (scaled down) weights:

vruntime           weight
39052385155836636  2      (curr)
43658311782076206  2
42824722322062111  4886

The difference between the minimum and maximum is 4605926626239570,
which is 53 bits. The total load is 4890. Even if you picked
zero_vruntime to be equidistant from the minimum and maximum, the
(vruntime - zero_vruntime) * load calculation in entity_eligible() is
doomed to overflow.

That range in vruntime seems too absurd to be due to only to running too
long without preemption. We're only seeing these crashes on internal
node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
all leads me to suspect reweight_entity().

Specifically, this line in reweight_entity():

	se->vlag = div_s64(se->vlag * se->load.weight, weight);

seems like it could create a very large vlag, which could cause
place_entity() to adjust vruntime by a large value. (place_entity() has
a similarly suspect adjustment on se->vlag, only update_entity_lag()
clamps it).

I'll try to reproduce something like this, but do you have any thoughts
on this theory in the meantime?

Thanks,
Omar

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ