linux-kernel - Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xhsmhle62ay5f.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Thu, 28 Mar 2024 21:39:56 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>, Frederic Weisbecker
 <frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>, Alex Shi
 <alexs@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Vincent Guittot
 <vincent.guittot@...aro.org>, Barry Song <song.bao.hua@...ilicon.com>,
 Tejun Heo <tj@...nel.org>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
 [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
 nohz_full)

On 27/03/24 21:42, Thomas Gleixner wrote:
> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> Now, on top of the above, there's one more thing worth noting:
>>   cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> the change log.
>
> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> trying to flush the work from within the cpu hotplug lock write held
> region is guaranteed to dead lock.
>
> That's why all of this got deferred to a work queue. The flush at the
> end of the hotplug code after dropping the hotplug lock is there to
> prevent that user space sees the CPU hotplug uevent before the work is
> done. But of course between bringing the CPU offline and the flush the
> kernel internal state is inconsistent.
>

Thanks for the summary!

> There was an attempt to make the CPU hotplug path synchronous in commit
> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> reverted with commit 2b729fe7f3 for the very wrong reason:
>
> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>
>  (cpu_hotplug_lock){++++}-{0:0}:
>   lock_acquire+0xe4/0x25c
>   cpus_read_lock+0x50/0x154
>   static_key_slow_inc+0x18/0x30
>   mem_cgroup_css_alloc+0x824/0x8b0
>   cgroup_apply_control_enable+0x1d8/0x56c
>   cgroup_apply_control+0x40/0x344
>   cgroup_subtree_control_write+0x664/0x69c
>   cgroup_file_write+0x130/0x2e8
>   kernfs_fop_write+0x228/0x32c
>   __vfs_write+0x84/0x1d8
>   vfs_write+0x13c/0x1b4
>   ksys_write+0xb0/0x120
>
> Instead of the revert this should have been fixed by doing:
>
> +  cpus_read_lock();
>    mutex_lock();
>    mem_cgroup_css_alloc();
> -  static_key_slow_inc();
> +  static_key_slow_inc_cpuslocked();
>

So looking at the state of things now, writing to the
cgroup.subtree_control file looks like: 

  cgroup_file_write()
  `\
    cgroup_subtree_control_write()
    `\
      cgroup_kn_lock_live()
      `\
      | cgroup_lock()
      | `\
      |   mutex_lock(&cgroup_mutex);
      |
      cgroup_apply_control()
      `\
        cgroup_apply_control_enable()
        `\
          css_create()
          `\
            ss->css_alloc() [mem_cgroup_css_alloc()]
            `\
              static_branch_inc()        

and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:

  static inline void cgroup_lock(void)
  {
+       cpus_read_lock();
	mutex_lock(&cgroup_mutex);
  }

  static inline void cgroup_unlock(void)
  {
	mutex_unlock(&cgroup_mutex);
-       cpus_read_unlock();        
  }

+ a handful of +_cpuslocked() changes.

As for cpuset, it looks like it's following this lock order:

	cpus_read_lock();
	mutex_lock(&cpuset_mutex);

Which AFAICT is what we want.

> Sorry that I did not notice this back then because I was too focussed on
> fixing that uevent nonsense in a halfways sane way.
>
> After that revert cpuset locking became completely insane.
>
> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> advanced part of that insanity. Seriously this commit is just tasteless
> and disgusting demonstration of how to paper over a problem which is not
> fully understood.
>
> After staring at it some more (including the history which led up to
> these insanities) I really think that the CPU hotplug path can be made
> truly synchronous and the memory hotplug path should just get the lock
> ordering correct.
>
> Can we please fix this for real and get all of this insanity out of the
> way

Yes please!

> including the nonsensical comments in the cpuset code:
>
>  * Call with cpuset_mutex held.  Takes cpus_read_lock().
>
>  ...
>
>        lockdep_assert_cpus_held();
>        lockdep_assert_held(&cpuset_mutex);
>
> Oh well...
>
> Thanks,
>
>         tglx