linux-kernel - Re: Bad psi_group_cpu.tasks[NR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240612094856.GV40213@noisy.programming.kicks-ass.net>
Date: Wed, 12 Jun 2024 11:48:56 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Max Kellermann <max.kellermann@...os.com>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>, linux-kernel@...r.kernel.org
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Wed, Jun 12, 2024 at 08:49:02AM +0200, Max Kellermann wrote:
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
> 
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Yeah, this. I can't find anything obvious either. The trickiest one is
read-ahead though, I couldn't immediately track all the
readahead_expand() callers, any such caller must then end up calling
read_pages() in order to land on the psi_memstall_leave(). This is
typically through page_cache_ra_*().

The erofs one is also not entirely obvious, but irrelevant if you're not using
it... the below should make it a little more obvious, but what do I know.

(whitespace mangled)

--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1698,9 +1698,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
                        submit_bio(bio);
                else
                        erofs_fscache_submit_bio(bio);
-               if (memstall)
-                       psi_memstall_leave(&pflags);
        }
+       if (memstall)
+               psi_memstall_leave(&pflags);
 
        /*
         * although background is preferred, no one is pending for submission.


> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
> 
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course);

It really is a bit tricky, performance and all that.

> I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
> 
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.

Best case would be if you could somehow find a reproducer, but
I realize this might be tricky.