linux-kernel - Query regarding deadlock involving cgroup_threadgroup_rwsem and cpu_hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <26d0e4cc-be0e-2c12-6174-dfbb1edb1ed6@oracle.com>
Date:   Wed, 20 Jul 2022 12:38:50 +1000
From:   Imran Khan <imran.f.khan@...cle.com>
To:     "tj@...nel.org >> Tejun Heo" <tj@...nel.org>,
        lizefan.x@...edance.com,
        "hannes@...xchg.org >> Johannes Weiner" <hannes@...xchg.org>,
        "tglx@...utronix.de >> Thomas Gleixner" <tglx@...utronix.de>,
        steven.price@....com,
        "peterz@...radead.org >> peterz"@infradead.org
Cc:     "cgroups@...r.kernel.org >> cgroups"@vger.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Query regarding deadlock involving cgroup_threadgroup_rwsem and
 cpu_hotplug_lock

Hello everyone,

I am seeing a deadlock between cgroup_threadgroup_rwsem and cpu_hotplug_lock in
5.4 kernel.

Due to some missing drivers I don't have this test setup for latest upstream
kernel but looking at the code the issue seems to be present in the latest
kernel as well. If needed I can provide stack traces and other relevant info
from the vmcore that I have got from 5.4 setup.

The description of the problem is as follows (I am using 5.19-rc7 as reference
below):

__cgroup_procs_write acquires cgroup_threadgroup_rwsem via
cgroup_procs_write_start and then invokes cgroup_attach_task. Now
cgroup_attach_task can invoke following call chain:

cgroup_attach_task --> cgroup_migrate --> cgroup_migrate_execute --> cpuset_attach

Here cpuset_attach tries to take cpu_hotplug_lock.

But by this time if some other context

1. is already in the middle of cpu hotplug and has acquired cpu_hotplug_lock in
_cpu_up but
2. has not yet reached CPUHP_ONLINE state and
3. one of the intermediate hotplug states (in my case CPUHP_AP_ONLINE_DYN ) has
a callback which involves creation of a thread (or invocation of copy_process
via some other path) the invoked copy_process will get blocked on
cgroup_threadgroup_rwsem in following call chain:

   copy_process --> cgroup_can_fork --> cgroup_css_set_fork -->
cgroup_threadgroup_change_begin

I am looking for suggestions to fix this deadlock.

Or if I am missing something in the above analysis and the above mention
scenario can't happen in latest upstream kernel, then please let me know as that
would help me in back porting relevant changes to 5.4 kernel because the issue
definitely exists in 5.4 kernel.

Thanks,
-- Imran