[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpGDw7LLs2dTa+9F4J8ZaSV2YMq=-LPgOmNgrgL4P84V_Q@mail.gmail.com>
Date: Tue, 6 Aug 2024 08:56:09 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
On Mon, Aug 5, 2024 at 5:35 AM Max Kellermann <max.kellermann@...os.com> wrote:
>
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > I think you can check if this theory pans out by adding a WARN_ON() ar
> > the end of psi_task_switch():
> >
> > void psi_task_switch(struct task_struct *prev, struct task_struct
> > *next, bool sleep)
> > {
> > ...
> > if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
> > clear &= ~TSK_ONCPU;
> > for (; group; group = group->parent)
> > psi_group_change(group, cpu, clear, set, now,
> > wake_clock);
> > }
> > + WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
> > }
>
> Our servers have been running with this experimental WARN_ON line you
> suggested, and today I found one of them had produced more than 300
> warnings since it was rebooted yesterday:
>
> ------------[ cut here ]------------
> WARNING: CPU: 38 PID: 448145 at kernel/sched/psi.c:992
> psi_task_switch+0x114/0x218
> Modules linked in:
> CPU: 38 PID: 448145 Comm: php-cgi8.1 Not tainted 6.9.12-cm4all1-ampere+ #178
> Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
> pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : psi_task_switch+0x114/0x218
> lr : psi_task_switch+0x98/0x218
> sp : ffff8000c5493c80
> x29: ffff8000c5493c80 x28: ffff0837ccd18640 x27: ffff07ff81ee3300
> x26: ffff0837ccd18000 x25: 0000000000000000 x24: 0000000000000001
> x23: 000000000000001c x22: 0000000000000026 x21: 00003010d610f448
> x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
> x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> x14: 0000000000000004 x13: ffff08072ca62000 x12: ffffc22f32e1a000
> x11: 0000000000000001 x10: 0000000000000026 x9 : ffffc22f3129b150
> x8 : ffffc22f32e1aa88 x7 : 000000000000000c x6 : 0000d7ed3b360390
> x5 : ffff08faff6fb88c x4 : 0000000000000000 x3 : 0000000000e9de78
> x2 : 000000008ff70300 x1 : 000000008ff8d518 x0 : 0000000000000002
> Call trace:
> psi_task_switch+0x114/0x218
> __schedule+0x390/0xbc8
> do_task_dead+0x64/0xa0
> do_exit+0x5ac/0x9c0
> __arm64_sys_exit+0x1c/0x28
> invoke_syscall.constprop.0+0x54/0xf0
> do_el0_svc+0xa4/0xc8
> el0_svc+0x18/0x58
> el0t_64_sync_handler+0xf8/0x128
> el0t_64_sync+0x14c/0x150
> ---[ end trace 0000000000000000 ]---
>
> And indeed, it has a constant (and bogus) memory pressure value:
>
> # cat /proc/pressure/memory
> some avg10=99.99 avg60=98.65 avg300=98.70 total=176280880996
> full avg10=98.16 avg60=96.70 avg300=96.82 total=173950123267
Hmm. The original scenario I was thinking about when I proposed this
WARN_ON() was deemed impossible, so I think the only other possibility
is that the task being killed somehow skipped psi_memstall_leave()
before its death... Did you have the instrumentation I suggested to
track imbalance between psi_memstall_enter()/psi_memstall_leave() and
to record the _RET_IP_? If so, did it trigger at all?
>
> It's taken nearly two months. and none of the other servers had
> produced this; this seems to be a bug that's really rare.
>
> Max
Powered by blists - more mailing lists