[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250625081951.GY1613200@noisy.programming.kicks-ass.net>
Date: Wed, 25 Jun 2025 10:19:51 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Aruna Ramakrishna <aruna.ramakrishna@...cle.com>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com
Subject: Re: [RFC PATCH] sched: Change nr_uninterruptible from unsigned to
signed int
(Please, be careful not to wrap quoted text, unwrapped it for you)
On Wed, Jun 25, 2025 at 04:48:36AM +0000, Aruna Ramakrishna wrote:
> We have encountered a bug where the load average displayed in top is
> abnormally high and obviously incorrect. The real values look like this
> (this is a production env, not a simulated one):
Whoopie..
> The nr_uninterruptible values for each of the CPU runqueues is large,
> and when they are summed up, the sum exceeds UINT_MAX, and the result
> is stored in a long, which preserves this overflow.
Right, that's the problem spot.
> long calc_load_fold_active(struct rq *this_rq, long adjust)
> {
> long nr_active, delta = 0;
>
> nr_active = this_rq->nr_running - adjust;
> nr_active += (int)this_rq->nr_uninterruptible;
> ...
> From kernel/sched/loadavg.c:
>
> * - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because
> * this would add another cross-CPU cacheline miss and atomic operation
> * to the wakeup path. Instead we increment on whatever CPU the task ran
> * when it went into uninterruptible state and decrement on whatever CPU
> * did the wakeup. This means that only the sum of nr_uninterruptible over
> * all CPUs yields the correct result.
> *
>
> It seems that rq->nr_uninterruptible can go to large (positive) values
> for one CPU if a lot of tasks were migrated off of that CPU after going
> into an uninterruptible state. If they’re woken up on another CPU -
> those target CPUs will have negative nr_uninterruptible values. I think
> the casting of an unsigned int to signed int and adding to a long is
> not preserving the sign, and results in a large positive value rather
> than the correct sum of zero.
So very close, yet so far...
> I suspect the bug surfaced as a side effect of this commit:
>
> commit e6fe3f422be128b7d65de607f6ae67bedc55f0ca
> Author: Alexey Dobriyan <adobriyan@...il.com>
> Date: Thu Apr 22 23:02:28 2021 +0300
>
> sched: Make multiple runqueue task counters 32-bit
>
> Make:
>
> struct dl_rq::dl_nr_migratory
> struct dl_rq::dl_nr_running
>
> struct rt_rq::rt_nr_boosted
> struct rt_rq::rt_nr_migratory
> struct rt_rq::rt_nr_total
>
> struct rq::nr_uninterruptible
>
> 32-bit.
>
> If total number of tasks can't exceed 2**32 (and less due to futex pid
> limits), then per-runqueue counters can't as well.
>
> This patchset has been sponsored by REX Prefix Eradication Society.
> ...
>
> which changed the counter nr_uninterruptible from unsigned long to unsigned
> int.
>
> Since nr_uninterruptible can be a positive or negative number, change
> the type from unsigned int to signed int.
(Strictly speaking it's making things worse, since signed overflow is UB
in regular C -- luckily we kernel folks have our own dialect and signed
and unsigned are both expected to wrap 2s-complement).
Also, we're already casting to (int) in the only place where we consume
the value. So changing the type should make no difference what so ever,
right?
> Another possible solution would be to partially rollback e6fe3f422be1,
> and change nr_uninterruptible back to unsigned long.
I think I prefer this.
Powered by blists - more mailing lists