lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625044836.3939605-1-aruna.ramakrishna@oracle.com>
Date: Wed, 25 Jun 2025 04:48:36 +0000
From: Aruna Ramakrishna <aruna.ramakrishna@...cle.com>
To: linux-kernel@...r.kernel.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        vschneid@...hat.com
Subject: [RFC PATCH] sched: Change nr_uninterruptible from unsigned to signed int

We have encountered a bug where the load average displayed in top is
abnormally high and obviously incorrect. The real values look like this
(this is a production env, not a simulated one):

top - 13:54:24 up 68 days, 14:33,  7 users,  load average:
4294967298.80, 4294967298.55, 4294967298.58
Threads: 5764 total,   5 running, 5759 sleeping,   0 stopped,   0 zombie

>From digging a bit into the vmcore:

crash> p calc_load_tasks
calc_load_tasks = $1 = {
  counter = 4294967297
}

which is:

crash> eval 4294967297
hexadecimal: 100000001

It seems like an overflow, since the value exceeds UINT_MAX.

Checking further:

The nr_uninterruptible values for each of the CPU runqueues is large,
and when they are summed up, the sum exceeds UINT_MAX, and the result
is stored in a long, which preserves this overflow.

long calc_load_fold_active(struct rq *this_rq, long adjust)
{
        long nr_active, delta = 0;

        nr_active = this_rq->nr_running - adjust;
        nr_active += (int)this_rq->nr_uninterruptible;
...

>From the vmcore:

>>> sum=0
>>> for cpu in for_each_online_cpu(prog):
...     rq = per_cpu(prog["runqueues"], cpu)
...     nr_unint = rq.nr_uninterruptible.value_()
...     sum += nr_unint
...     print(f"CPU {cpu}: nr_uninterruptible = {hex(nr_unint)}")
...     print(f"sum {hex(sum)}")
...
CPU 0: nr_uninterruptible = 0x638dd3
sum 0x638dd3
CPU 1: nr_uninterruptible = 0x129fb26
sum 0x18d88f9
CPU 2: nr_uninterruptible = 0xd8281f
sum 0x265b118
...
CPU 94: nr_uninterruptible = 0xe0a86
sum 0xfff1e855
CPU 95: nr_uninterruptible = 0xe17ab
sum 0x100000000

This is what we see, stored in calc_load_tasks. The correct sum here would be 0.

>From kernel/sched/loadavg.c:

 *  - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU
because
 *    this would add another cross-CPU cacheline miss and atomic
operation
 *    to the wakeup path. Instead we increment on whatever CPU the task
ran
 *    when it went into uninterruptible state and decrement on whatever
CPU
 *    did the wakeup. This means that only the sum of nr_uninterruptible
over
 *    all CPUs yields the correct result.
 *

It seems that rq->nr_uninterruptible can go to large (positive) values
for one CPU if a lot of tasks were migrated off of that CPU after going
into an uninterruptible state. If they’re woken up on another CPU -
those target CPUs will have negative nr_uninterruptible values. I think
the casting of an unsigned int to signed int and adding to a long is
not preserving the sign, and results in a large positive value rather
than the correct sum of zero.

I suspect the bug surfaced as a side effect of this commit:

commit e6fe3f422be128b7d65de607f6ae67bedc55f0ca
Author: Alexey Dobriyan <adobriyan@...il.com>
Date:   Thu Apr 22 23:02:28 2021 +0300

    sched: Make multiple runqueue task counters 32-bit

    Make:

            struct dl_rq::dl_nr_migratory
            struct dl_rq::dl_nr_running

            struct rt_rq::rt_nr_boosted
            struct rt_rq::rt_nr_migratory
            struct rt_rq::rt_nr_total

            struct rq::nr_uninterruptible

    32-bit.

    If total number of tasks can't exceed 2**32 (and less due to futex
pid
    limits), then per-runqueue counters can't as well.

    This patchset has been sponsored by REX Prefix Eradication Society.
...

which changed the counter nr_uninterruptible from unsigned long to unsigned
int.

Since nr_uninterruptible can be a positive or negative number, change
the type from unsigned int to signed int.

Another possible solution would be to partially rollback e6fe3f422be1,
and change nr_uninterruptible back to unsigned long.

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@...cle.com>
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..f6d21278e64e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1149,7 +1149,7 @@ struct rq {
 	 * one CPU and if it got migrated afterwards it may decrease
 	 * it on another CPU. Always updated under the runqueue lock:
 	 */
-	unsigned int		nr_uninterruptible;
+	int 			nr_uninterruptible;
 
 	union {
 		struct task_struct __rcu *donor; /* Scheduler context */

base-commit: 86731a2a651e58953fc949573895f2fa6d456841
prerequisite-patch-id: dd6db7012c5094dec89e689ba56fd3551d2b4a40
-- 
2.43.5


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ