[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2fc3448-dd5c-42fe-ac21-c8e1c10e94b4@redhat.com>
Date: Sat, 31 Jan 2026 18:13:09 -0500
From: Waiman Long <llong@...hat.com>
To: Chen Ridong <chenridong@...weicloud.com>, Tejun Heo <tj@...nel.org>,
Johannes Weiner <hannes@...xchg.org>, Michal Koutný
<mkoutny@...e.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Anna-Maria Behnsen <anna-maria@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>, Shuah Khan <shuah@...nel.org>
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org
Subject: Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level
cpuset_top_mutex
On 1/30/26 9:53 PM, Chen Ridong wrote:
>
> On 2026/1/30 23:42, Waiman Long wrote:
>> The current cpuset partition code is able to dynamically update
>> the sched domains of a running system and the corresponding
>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>> "isolcpus=domain,..." boot command line feature at run time.
>>
>> The housekeeping cpumask update requires flushing a number of different
>> workqueues which may not be safe with cpus_read_lock() held as the
>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>> which have locking dependency with cpus_read_lock() down the chain. Below
>> is an example of such circular locking problem.
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.18.0-test+ #2 Tainted: G S
>> ------------------------------------------------------
>> test_cpuset_prs/10971 is trying to acquire lock:
>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
>>
>> but task is already holding lock:
>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>> -> #4 (cpuset_mutex){+.+.}-{4:4}:
>> -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>> -> #2 (rtnl_mutex){+.+.}-{4:4}:
>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>
>> Chain exists of:
>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>
>> 5 locks held by test_cpuset_prs/10971:
>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>> Call Trace:
>> <TASK>
>> :
>> touch_wq_lockdep_map+0x93/0x180
>> __flush_workqueue+0x111/0x10b0
>> housekeeping_update+0x12d/0x2d0
>> update_parent_effective_cpumask+0x595/0x2440
>> update_prstate+0x89d/0xce0
>> cpuset_partition_write+0xc5/0x130
>> cgroup_file_write+0x1a5/0x680
>> kernfs_fop_write_iter+0x3df/0x5f0
>> vfs_write+0x525/0xfd0
>> ksys_write+0xf9/0x1d0
>> do_syscall_64+0x95/0x520
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> To avoid such a circular locking dependency problem, we have to
>> call housekeeping_update() without holding the cpus_read_lock() and
>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>> may not have work functions that call cpus_read_lock() directly,
>> but we are likely to extend the list of wq's that are flushed in the
>> future. Moreover, the current set of work functions may hold locks that
>> may have cpu_hotplug_lock down the dependency chain.
>>
>> One way to do that is to introduce a new top level cpuset_top_mutex
>> which will be acquired first. This new cpuset_top_mutex will provide
>> the need mutual exclusion without the need to hold cpus_read_lock().
>>
> Introducing a new global lock warrants careful consideration. I wonder if we
> could make all updates to isolated_cpus asynchronous. If that is feasible, we
> could avoid adding a global lock altogether. If not, we need to clarify which
> updates must remain synchronous and which ones can be handled asynchronously.
Almost all the cpuset code are run with cpuset_mutex held with either
cpus_read_lock or cpus_write_lock. So there is no concurrent
access/update to any of the cpuset internal data. The new
cpuset_top_mutex is aded to resolve the possible deadlock scenarios with
the new housekeeping_update() call without breaking this model. Allow
parallel concurrent access/update to cpuset data will greatly complicate
the code and we will likely missed some corner cases that we have to fix
in the future. We will only do that if cpuset is in a critical
performance path, but it is not. It is not just isolated_cpus that we
are protecting, all the other cpuset data may be at risk if we don't
have another top level mutex to protect them.
Cheers,
Longman
Powered by blists - more mailing lists