[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64f4db2e-0a7f-4de2-aa5a-25f32cefa746@redhat.com>
Date: Fri, 29 Mar 2024 13:06:12 -0400
From: Waiman Long <longman@...hat.com>
To: Tejun Heo <tj@...nel.org>, Valentin Schneider <vschneid@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>,
"Paul E. McKenney" <paulmck@...nel.org>, LKML
<linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>, Alex Shi <alexs@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Barry Song <song.bao.hua@...ilicon.com>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
[PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
nohz_full)
On 3/28/24 22:08, Tejun Heo wrote:
> (cc'ing Waiman and quoting whole body)
>
> Hello,
>
> On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
>> On 27/03/24 21:42, Thomas Gleixner wrote:
>>> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>>>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>>>> Now, on top of the above, there's one more thing worth noting:
>>>> cpu_up_down_serialize_trainwrecks()
>>>>
>>>> This just flushes the cpuset work, so after that the sched_domain topology
>>>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>>>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>>>> around this vs. lock ordering aren't very reassuring however, so I need to
>>>> look into this more.
>>> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
>>> the change log.
>>>
>>> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
>>> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
>>> trying to flush the work from within the cpu hotplug lock write held
>>> region is guaranteed to dead lock.
>>>
>>> That's why all of this got deferred to a work queue. The flush at the
>>> end of the hotplug code after dropping the hotplug lock is there to
>>> prevent that user space sees the CPU hotplug uevent before the work is
>>> done. But of course between bringing the CPU offline and the flush the
>>> kernel internal state is inconsistent.
>>>
>> Thanks for the summary!
>>
>>> There was an attempt to make the CPU hotplug path synchronous in commit
>>> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
>>> reverted with commit 2b729fe7f3 for the very wrong reason:
>>>
>>> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>>>
>>> (cpu_hotplug_lock){++++}-{0:0}:
>>> lock_acquire+0xe4/0x25c
>>> cpus_read_lock+0x50/0x154
>>> static_key_slow_inc+0x18/0x30
>>> mem_cgroup_css_alloc+0x824/0x8b0
>>> cgroup_apply_control_enable+0x1d8/0x56c
>>> cgroup_apply_control+0x40/0x344
>>> cgroup_subtree_control_write+0x664/0x69c
>>> cgroup_file_write+0x130/0x2e8
>>> kernfs_fop_write+0x228/0x32c
>>> __vfs_write+0x84/0x1d8
>>> vfs_write+0x13c/0x1b4
>>> ksys_write+0xb0/0x120
>>>
>>> Instead of the revert this should have been fixed by doing:
>>>
>>> + cpus_read_lock();
>>> mutex_lock();
>>> mem_cgroup_css_alloc();
>>> - static_key_slow_inc();
>>> + static_key_slow_inc_cpuslocked();
>>>
>> So looking at the state of things now, writing to the
>> cgroup.subtree_control file looks like:
>>
>> cgroup_file_write()
>> `\
>> cgroup_subtree_control_write()
>> `\
>> cgroup_kn_lock_live()
>> `\
>> | cgroup_lock()
>> | `\
>> | mutex_lock(&cgroup_mutex);
>> |
>> cgroup_apply_control()
>> `\
>> cgroup_apply_control_enable()
>> `\
>> css_create()
>> `\
>> ss->css_alloc() [mem_cgroup_css_alloc()]
>> `\
>> static_branch_inc()
>>
>> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
>> the revert, we'd probably want to go for:
>>
>> static inline void cgroup_lock(void)
>> {
>> + cpus_read_lock();
>> mutex_lock(&cgroup_mutex);
>> }
>>
>> static inline void cgroup_unlock(void)
>> {
>> mutex_unlock(&cgroup_mutex);
>> - cpus_read_unlock();
>> }
>>
>> + a handful of +_cpuslocked() changes.
>>
>> As for cpuset, it looks like it's following this lock order:
>>
>> cpus_read_lock();
>> mutex_lock(&cpuset_mutex);
>>
>> Which AFAICT is what we want.
>>
>>> Sorry that I did not notice this back then because I was too focussed on
>>> fixing that uevent nonsense in a halfways sane way.
>>>
>>> After that revert cpuset locking became completely insane.
>>>
>>> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
>>> advanced part of that insanity. Seriously this commit is just tasteless
>>> and disgusting demonstration of how to paper over a problem which is not
>>> fully understood.
>>>
>>> After staring at it some more (including the history which led up to
>>> these insanities) I really think that the CPU hotplug path can be made
>>> truly synchronous and the memory hotplug path should just get the lock
>>> ordering correct.
>>>
>>> Can we please fix this for real and get all of this insanity out of the
>>> way
>> Yes please!
> Yeah, making that operation synchronous would be nice. Waiman, you're a lot
> more familiar with this part than I am. Can you please take a look and see
> whether we can turn the sched domain updates synchronous?
Sure. I will take a look on how to make cpu hotplug sched domain update
synchronous.
Cheers,
Longman
Powered by blists - more mailing lists