linux-kernel - [RFC] Potential deadlock with PM and vmstat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAJx26kWEEuWL6G9+EQ7Y89HKR5kL1Qqa0iQguXW23-4aKPT8cA@mail.gmail.com>
Date:   Thu, 3 Nov 2016 15:35:26 -0700
From:   Justin Chen <justinpopo6@...il.com>
To:     linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
Cc:     Florian Fainelli <f.fainelli@...il.com>
Subject: [RFC] Potential deadlock with PM and vmstat

Hello,

I am experiencing a deadlock in my system when looping through the PM
sequence. The system locks up when trying to bring nonboot cpus down
(hot plugging cpus) with vmstat enabled. The issue is the
cpu_hotplug.lock.

In kernel/cpu.c:_cpu_down(), we begin the cpu bring down. The deadlock
occurs when parking kthreads. This will lock up if the kthread we are
trying to park is waiting on the cpu_hotplug.lock, because this lock
is currently held by the boot cpu at cpu_hotplug_begin().

Here is the sequence that I am seeing(4.1 kernel):
CPU0 goes into the suspend sequence and drops into kernel/cpu.c:_cpu_down().
CPU0 calls cpu_hotplug_begin() and grabs the cpu_hotplug.lock.
CPU0 blocks at smpboot_park_threads(...) waiting for kthreads to be stopped.

CPU1 has a kthread started by vmstat at mm/vmstat.c:
vmstat_shepherd(). In get_online_cpus() the kthread tries to grab the
cpu_hotplug.lock and blocks. So the kthread cannot be parked.

If I am understanding this correctly, this deadlock may happen if
kthreads are parked with the cpu_hotplug.lock held. I haven't tested
this on the most recent kernel(4.9-rc3), but it seems like the
conditions for the deadlock still exist except called in a different
sequence.

If this seems like a valid issue, I will try to put together a patch
to address this issue. Suggestions welcome!

Thanks,
Justin