linux-kernel - Re: Bad psi_group_cpu.tasks[NR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpGa55gpKHBE_0mwRPsf0f1Wp5UK7+w6N7yZi-7v31vNzw@mail.gmail.com>
Date: Tue, 11 Jun 2024 22:01:29 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra <peterz@...radead.org>, 
	linux-kernel@...r.kernel.org
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Tue, Jun 4, 2024 at 12:16 AM Max Kellermann <max.kellermann@...os.com> wrote:
>
> Hi kernel people,
> I have a problem that I have been trying to debug for a few days, but
> I got lost in the depths of the scheduler code; I'm stuck and I need
> your help.
> We have several servers which show a constant memory.pressure value of
> 30 to 100 (or more), even when the server is idle. I tracked this down
> to psi_group_cpu.tasks[NR_MEMSTALL]==1 even though no such process
> exists, but I can't figure out why the kernel thinks there is still
> one task stuck in memstall. I tried to track down all the code paths
> that lead to psi_group_change(), but found nothing conclusive, and
> failed to reproduce it on a test machine with kernel patches injecting
> delays (trying to cause data race bugs that may have caused this
> problem).
>
> This happened on servers that were very busy and indeed were in
> memstall often due to going over memory.high frequently. We have one
> "main" cgroup with memory.high configured, and all the workload
> processes live in sub-cgroups, of which we always have a few thousand.
> When memory.events gets triggered, our process manager stops a bunch
> of idle processes to free up memory, which then deletes the sub-cgroup
> they belong to. In other words: sub-cgroups get created and deleted
> very often, and they get deleted when there is indeed memory stall
> happening. My theory was that there could be a data race bug that
> forgets to decrement tasks[NR_MEMSTALL], maybe when a stalled child
> cgroup gets deleted.

Hi Max,
I'm not an expert in the scheduler (I maintain mostly PSI triggers),
so my feedback might be utterly wrong.
I looked a bit into the relevant code and I think if your theory was
correct and psi_task_change() was called while task's cgroup is
destroyed then task_psi_group() would have returned an invalid pointer
and we would crash once that value is dereferenced.
Instead I think what might be happening is that the task is terminated
while it's in memstall. do_exit() calls do_task_dead() at the very
end, which sets current->__state to TASK_DEAD and calls the last
__schedule() for this task. __schedule() will call deactivate_task(rq,
prev, DEQUEUE_SLEEP) which will set prev->on_rq = NULL and call
dequeue_task(..., DEQUEUE_SLEEP) leading to psi_dequeue(..., true).
Note that because of that last parameter of psi_dequeue() is "true",
psi_task_change() will not be called at this time. Later on
__schedule() calls psi_sched_switch(). That leads to psi_task_switch()
but note that the last parameter will be "true" because prev->on_rq ==
NULL. So we end up calling psi_task_switch(, true). Now, note this
line: https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L955.
It will clear TSK_MEMSTALL_RUNNING but not TSK_MEMSTALL. So, if the
task was in TSK_MEMSTALL state then that state won't be cleared, which
might be the problem you are facing.
I think you can check if this theory pans out by adding a WARN_ON() ar
the end of psi_task_switch():

void psi_task_switch(struct task_struct *prev, struct task_struct
*next, bool sleep)
{
...
        if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
                clear &= ~TSK_ONCPU;
                for (; group; group = group->parent)
                        psi_group_change(group, cpu, clear, set, now,
wake_clock);
        }
+        WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
}

Again, I am by no means an expert in this area. Johannes or Peter
would be much better people to consult with.
Thanks,
Suren.

> On our Grafana, I can easily track the beginning of this bug to a
> point two weeks ago; in the system log, I can see that hundreds of
> processes needed to be terminated due to memory pressure at that time.
>
> The affected servers run kernel 6.8.7 with a few custom patches, but
> none of these patches affect the scheduler or cgroups; they're about
> unrelated things like denying access to Ceph snapshots and adjusting
> debugfs permissions. (I submitted most of those patches to LKML long
> ago but nobody cared.)
> Newer kernels don't seem to have fixes for my problem; the relevant
> parts of the code are unchanged.
>
> One of the servers is still running with this problem, and I can
> access it with gdb on /proc/kcore. I'll keep it that way for some more
> time, so if you have any idea what to look for, let me know.
>
> This is the psi_group of the "main" cgroup:
>
>  $1 = {parent = 0xffff9de707287800, enabled = true, avgs_lock = {owner
> = {counter = 0}, wait_lock = {raw_lock = {{val = {counter = 0},
> {locked = 0 '\000', pending = 0 '\000'}, {locked_pending = 0, tail =
> 0}}}}, osq = {tail = {
>         counter = 0}}, wait_list = {next = 0xffff9de70f772820, prev =
> 0xffff9de70f772820}}, pcpu = 0x3fb640033900, avg_total =
> {6133960836647, 5923217690044, 615319825665255, 595479374843164,
> 19259777147170, 12847590051880},
>   avg_last_update = 1606208471280060, avg_next_update =
> 1606210394507082, avgs_work = {work = {data = {counter = 321}, entry =
> {next = 0xffff9de70f772880, prev = 0xffff9de70f772880}, func =
> 0xffffffff880dcc00}, timer = {entry = {
>         next = 0x0 <fixed_percpu_data>, pprev = 0xffff9e05bef5bc48},
> expires = 4455558105, function = 0xffffffff880a1ca0, flags =
> 522190853}, wq = 0xffff9de700051400, cpu = 64}, avg_triggers = {next =
> 0xffff9de70f7728d0,
>     prev = 0xffff9de70f7728d0}, avg_nr_triggers = {0, 0, 0, 0, 0, 0},
> total = {{6133960836647, 5923217690044, 615328415599847,
> 595487964777756, 19281251983650, 12869064888360}, {6092994926,
> 5559819737, 105947464151, 100672353730,
>       8196529519, 7678536634}}, avg = {{0, 0, 0}, {0, 0, 0}, {203596,
> 203716, 198499}, {203596, 203716, 198288}, {0, 0, 60}, {0, 0, 0}},
> rtpoll_task = 0x0 <fixed_percpu_data>, rtpoll_timer = {entry = {next =
> 0xdead000000000122,
>       pprev = 0x0 <fixed_percpu_data>}, expires = 4405010639, function
> = 0xffffffff880dac50, flags = 67108868}, rtpoll_wait = {lock = {{rlock
> = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0
> '\000'}, {
>                 locked_pending = 0, tail = 0}}}}}}, head = {next =
> 0xffff9de70f772a20, prev = 0xffff9de70f772a20}}, rtpoll_wakeup =
> {counter = 0}, rtpoll_scheduled = {counter = 0}, rtpoll_trigger_lock =
> {owner = {counter = 0},
>     wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000',
> pending = 0 '\000'}, {locked_pending = 0, tail = 0}}}}, osq = {tail =
> {counter = 0}}, wait_list = {next = 0xffff9de70f772a48, prev =
> 0xffff9de70f772a48}},
>   rtpoll_triggers = {next = 0xffff9de70f772a58, prev =
> 0xffff9de70f772a58}, rtpoll_nr_triggers = {0, 0, 0, 0, 0, 0},
> rtpoll_states = 0, rtpoll_min_period = 18446744073709551615,
> rtpoll_total = {6092994926, 5559819737, 105947464151,
>     100672353730, 8196529519, 7678536634}, rtpoll_next_update =
> 1100738436720135, rtpoll_until = 0}
>
> This is a summary of all psi_group_pcpu for the 32 CPU cores (on the
> way, I wrote a small gdb script to dump interesting details like these
> but that went nowhere):
>
>   state_mask 0 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 1 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 2 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 3 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 4 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 5 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 6 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 7 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 8 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 9 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 10 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 11 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 12 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 13 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 14 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 15 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 16 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 17 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 18 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 19 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 20 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 21 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 22 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 23 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 24 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 25 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 26 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 27 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 28 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 29 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 30 = 0x4c tasks {0, 1, 0, 0}
>   state_mask 31 = 0x0 tasks {0, 0, 0, 0}
>
> CPU core 30 is stuck with this bogus value. state_mask 0x4c =
> PSI_MEM_SOME|PSI_MEM_FULL|PSI_NONIDLE.
>
> The memory pressure at the time of this writing:
>
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/memory.pressure
>  some avg10=99.22 avg60=99.39 avg300=97.62 total=615423620626
>  full avg10=99.22 avg60=99.39 avg300=97.54 total=595583169804
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/_/memory.pressure
>  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
>  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/cgroup.stat
>  nr_descendants 1
>  nr_dying_descendants 2224
>
> There is currently no worker process; there is only one idle dummy
> process in a single sub-cgroup called "_", only there to keep the
> systemd scope populated. It should therefore be impossible to have
> memory.pressure when the only leaf cgroup has pressure=0.
>
> (nr_dying_descendants is decremented extremely slowly; I deactivated
> the server shortly before collecting these numbers, to make sure it's
> really idle and there are really no processes left to cause this
> pressure. I don't think nr_dying_descendants is relevant for this
> problem; even after two days of full idle, the counter and the
> pressure didn't go back to zero.)
>
> Please help :-)
>
> Max