linux-kernel - Re: Bad psi_group_cpu.tasks[NR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJuCfpG+5e2mELmc=kCdbKjDPBSb4rvo03bOSYeASZnsZPOF4w@mail.gmail.com>
Date: Wed, 12 Jun 2024 08:17:20 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra <peterz@...radead.org>, 
	linux-kernel@...r.kernel.org
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Tue, Jun 11, 2024 at 11:49 PM Max Kellermann
<max.kellermann@...os.com> wrote:
>
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
>
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Doh! I made an assumption that this can happen while it should not,
unless psi_memstall_enter()/psi_memstall_leave() are not balanced. My
bad.

Since the issue is hard to reproduce, maybe you could add debugging
code to store _RET_IP_ inside the task_struct at the end of
psi_memstall_enter() and clear it inside psi_memstall_leave(). Then in
do_exit() you check if it's still set and generate a warning reporting
recorded _RET_IP_. This should hint us to which psi_memstall_enter()
was missing its psi_memstall_leave().

>
> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
>
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course); I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
>
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.
>
> Max