[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhle62ay5f.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Thu, 28 Mar 2024 21:39:56 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>, Frederic Weisbecker
<frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>, Alex Shi
<alexs@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Vincent Guittot
<vincent.guittot@...aro.org>, Barry Song <song.bao.hua@...ilicon.com>,
Tejun Heo <tj@...nel.org>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
[PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
nohz_full)
On 27/03/24 21:42, Thomas Gleixner wrote:
> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> Now, on top of the above, there's one more thing worth noting:
>> cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> the change log.
>
> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> trying to flush the work from within the cpu hotplug lock write held
> region is guaranteed to dead lock.
>
> That's why all of this got deferred to a work queue. The flush at the
> end of the hotplug code after dropping the hotplug lock is there to
> prevent that user space sees the CPU hotplug uevent before the work is
> done. But of course between bringing the CPU offline and the flush the
> kernel internal state is inconsistent.
>
Thanks for the summary!
> There was an attempt to make the CPU hotplug path synchronous in commit
> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> reverted with commit 2b729fe7f3 for the very wrong reason:
>
> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>
> (cpu_hotplug_lock){++++}-{0:0}:
> lock_acquire+0xe4/0x25c
> cpus_read_lock+0x50/0x154
> static_key_slow_inc+0x18/0x30
> mem_cgroup_css_alloc+0x824/0x8b0
> cgroup_apply_control_enable+0x1d8/0x56c
> cgroup_apply_control+0x40/0x344
> cgroup_subtree_control_write+0x664/0x69c
> cgroup_file_write+0x130/0x2e8
> kernfs_fop_write+0x228/0x32c
> __vfs_write+0x84/0x1d8
> vfs_write+0x13c/0x1b4
> ksys_write+0xb0/0x120
>
> Instead of the revert this should have been fixed by doing:
>
> + cpus_read_lock();
> mutex_lock();
> mem_cgroup_css_alloc();
> - static_key_slow_inc();
> + static_key_slow_inc_cpuslocked();
>
So looking at the state of things now, writing to the
cgroup.subtree_control file looks like:
cgroup_file_write()
`\
cgroup_subtree_control_write()
`\
cgroup_kn_lock_live()
`\
| cgroup_lock()
| `\
| mutex_lock(&cgroup_mutex);
|
cgroup_apply_control()
`\
cgroup_apply_control_enable()
`\
css_create()
`\
ss->css_alloc() [mem_cgroup_css_alloc()]
`\
static_branch_inc()
and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:
static inline void cgroup_lock(void)
{
+ cpus_read_lock();
mutex_lock(&cgroup_mutex);
}
static inline void cgroup_unlock(void)
{
mutex_unlock(&cgroup_mutex);
- cpus_read_unlock();
}
+ a handful of +_cpuslocked() changes.
As for cpuset, it looks like it's following this lock order:
cpus_read_lock();
mutex_lock(&cpuset_mutex);
Which AFAICT is what we want.
> Sorry that I did not notice this back then because I was too focussed on
> fixing that uevent nonsense in a halfways sane way.
>
> After that revert cpuset locking became completely insane.
>
> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> advanced part of that insanity. Seriously this commit is just tasteless
> and disgusting demonstration of how to paper over a problem which is not
> fully understood.
>
> After staring at it some more (including the history which led up to
> these insanities) I really think that the CPU hotplug path can be made
> truly synchronous and the memory hotplug path should just get the lock
> ordering correct.
>
> Can we please fix this for real and get all of this insanity out of the
> way
Yes please!
> including the nonsensical comments in the cpuset code:
>
> * Call with cpuset_mutex held. Takes cpus_read_lock().
>
> ...
>
> lockdep_assert_cpus_held();
> lockdep_assert_held(&cpuset_mutex);
>
> Oh well...
>
> Thanks,
>
> tglx
Powered by blists - more mailing lists