linux-kernel - Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <64f4db2e-0a7f-4de2-aa5a-25f32cefa746@redhat.com>
Date: Fri, 29 Mar 2024 13:06:12 -0400
From: Waiman Long <longman@...hat.com>
To: Tejun Heo <tj@...nel.org>, Valentin Schneider <vschneid@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
 Frederic Weisbecker <frederic@...nel.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, LKML
 <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>, Alex Shi <alexs@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Barry Song <song.bao.hua@...ilicon.com>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
 [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
 nohz_full)

On 3/28/24 22:08, Tejun Heo wrote:
> (cc'ing Waiman and quoting whole body)
>
> Hello,
>
> On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
>> On 27/03/24 21:42, Thomas Gleixner wrote:
>>> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>>>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>>>> Now, on top of the above, there's one more thing worth noting:
>>>>    cpu_up_down_serialize_trainwrecks()
>>>>
>>>> This just flushes the cpuset work, so after that the sched_domain topology
>>>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>>>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>>>> around this vs. lock ordering aren't very reassuring however, so I need to
>>>> look into this more.
>>> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
>>> the change log.
>>>
>>> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
>>> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
>>> trying to flush the work from within the cpu hotplug lock write held
>>> region is guaranteed to dead lock.
>>>
>>> That's why all of this got deferred to a work queue. The flush at the
>>> end of the hotplug code after dropping the hotplug lock is there to
>>> prevent that user space sees the CPU hotplug uevent before the work is
>>> done. But of course between bringing the CPU offline and the flush the
>>> kernel internal state is inconsistent.
>>>
>> Thanks for the summary!
>>
>>> There was an attempt to make the CPU hotplug path synchronous in commit
>>> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
>>> reverted with commit 2b729fe7f3 for the very wrong reason:
>>>
>>> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>>>
>>>   (cpu_hotplug_lock){++++}-{0:0}:
>>>    lock_acquire+0xe4/0x25c
>>>    cpus_read_lock+0x50/0x154
>>>    static_key_slow_inc+0x18/0x30
>>>    mem_cgroup_css_alloc+0x824/0x8b0
>>>    cgroup_apply_control_enable+0x1d8/0x56c
>>>    cgroup_apply_control+0x40/0x344
>>>    cgroup_subtree_control_write+0x664/0x69c
>>>    cgroup_file_write+0x130/0x2e8
>>>    kernfs_fop_write+0x228/0x32c
>>>    __vfs_write+0x84/0x1d8
>>>    vfs_write+0x13c/0x1b4
>>>    ksys_write+0xb0/0x120
>>>
>>> Instead of the revert this should have been fixed by doing:
>>>
>>> +  cpus_read_lock();
>>>     mutex_lock();
>>>     mem_cgroup_css_alloc();
>>> -  static_key_slow_inc();
>>> +  static_key_slow_inc_cpuslocked();
>>>
>> So looking at the state of things now, writing to the
>> cgroup.subtree_control file looks like:
>>
>>    cgroup_file_write()
>>    `\
>>      cgroup_subtree_control_write()
>>      `\
>>        cgroup_kn_lock_live()
>>        `\
>>        | cgroup_lock()
>>        | `\
>>        |   mutex_lock(&cgroup_mutex);
>>        |
>>        cgroup_apply_control()
>>        `\
>>          cgroup_apply_control_enable()
>>          `\
>>            css_create()
>>            `\
>>              ss->css_alloc() [mem_cgroup_css_alloc()]
>>              `\
>>                static_branch_inc()
>>
>> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
>> the revert, we'd probably want to go for:
>>
>>    static inline void cgroup_lock(void)
>>    {
>> +       cpus_read_lock();
>> 	mutex_lock(&cgroup_mutex);
>>    }
>>
>>    static inline void cgroup_unlock(void)
>>    {
>> 	mutex_unlock(&cgroup_mutex);
>> -       cpus_read_unlock();
>>    }
>>
>> + a handful of +_cpuslocked() changes.
>>
>> As for cpuset, it looks like it's following this lock order:
>>
>> 	cpus_read_lock();
>> 	mutex_lock(&cpuset_mutex);
>>
>> Which AFAICT is what we want.
>>
>>> Sorry that I did not notice this back then because I was too focussed on
>>> fixing that uevent nonsense in a halfways sane way.
>>>
>>> After that revert cpuset locking became completely insane.
>>>
>>> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
>>> advanced part of that insanity. Seriously this commit is just tasteless
>>> and disgusting demonstration of how to paper over a problem which is not
>>> fully understood.
>>>
>>> After staring at it some more (including the history which led up to
>>> these insanities) I really think that the CPU hotplug path can be made
>>> truly synchronous and the memory hotplug path should just get the lock
>>> ordering correct.
>>>
>>> Can we please fix this for real and get all of this insanity out of the
>>> way
>> Yes please!
> Yeah, making that operation synchronous would be nice. Waiman, you're a lot
> more familiar with this part than I am. Can you please take a look and see
> whether we can turn the sched domain updates synchronous?

Sure. I will take a look on how to make cpu hotplug sched domain update 
synchronous.

Cheers,
Longman