linux-kernel - Re: Bad psi_group_cpu.tasks[NR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKPOu+-DdwTCFDjW+ykKM5Da5wmLW3gSx5=x+fsSdaMEwUuvJw@mail.gmail.com>
Date: Mon, 5 Aug 2024 14:34:52 +0200
From: Max Kellermann <max.kellermann@...os.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra <peterz@...radead.org>, 
	linux-kernel@...r.kernel.org
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> I think you can check if this theory pans out by adding a WARN_ON() ar
> the end of psi_task_switch():
>
> void psi_task_switch(struct task_struct *prev, struct task_struct
> *next, bool sleep)
> {
> ...
>         if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
>                 clear &= ~TSK_ONCPU;
>                 for (; group; group = group->parent)
>                         psi_group_change(group, cpu, clear, set, now,
> wake_clock);
>         }
> +        WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
> }

Our servers have been running with this experimental WARN_ON line you
suggested, and today I found one of them had produced more than 300
warnings since it was rebooted yesterday:

 ------------[ cut here ]------------
 WARNING: CPU: 38 PID: 448145 at kernel/sched/psi.c:992
psi_task_switch+0x114/0x218
 Modules linked in:
 CPU: 38 PID: 448145 Comm: php-cgi8.1 Not tainted 6.9.12-cm4all1-ampere+ #178
 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
 pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : psi_task_switch+0x114/0x218
 lr : psi_task_switch+0x98/0x218
 sp : ffff8000c5493c80
 x29: ffff8000c5493c80 x28: ffff0837ccd18640 x27: ffff07ff81ee3300
 x26: ffff0837ccd18000 x25: 0000000000000000 x24: 0000000000000001
 x23: 000000000000001c x22: 0000000000000026 x21: 00003010d610f448
 x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
 x14: 0000000000000004 x13: ffff08072ca62000 x12: ffffc22f32e1a000
 x11: 0000000000000001 x10: 0000000000000026 x9 : ffffc22f3129b150
 x8 : ffffc22f32e1aa88 x7 : 000000000000000c x6 : 0000d7ed3b360390
 x5 : ffff08faff6fb88c x4 : 0000000000000000 x3 : 0000000000e9de78
 x2 : 000000008ff70300 x1 : 000000008ff8d518 x0 : 0000000000000002
 Call trace:
  psi_task_switch+0x114/0x218
  __schedule+0x390/0xbc8
  do_task_dead+0x64/0xa0
  do_exit+0x5ac/0x9c0
  __arm64_sys_exit+0x1c/0x28
  invoke_syscall.constprop.0+0x54/0xf0
  do_el0_svc+0xa4/0xc8
  el0_svc+0x18/0x58
  el0t_64_sync_handler+0xf8/0x128
  el0t_64_sync+0x14c/0x150
 ---[ end trace 0000000000000000 ]---

And indeed, it has a constant (and bogus) memory pressure value:

 # cat /proc/pressure/memory
 some avg10=99.99 avg60=98.65 avg300=98.70 total=176280880996
 full avg10=98.16 avg60=96.70 avg300=96.82 total=173950123267

It's taken nearly two months. and none of the other servers had
produced this; this seems to be a bug that's really rare.

Max