linux-kernel - Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87cyrfe7a3.ffs@tglx>
Date: Wed, 27 Mar 2024 21:42:12 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Valentin Schneider <vschneid@...hat.com>, Frederic Weisbecker
 <frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>, Alex Shi
 <alexs@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Vincent Guittot
 <vincent.guittot@...aro.org>, Barry Song <song.bao.hua@...ilicon.com>,
 Tejun Heo <tj@...nel.org>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
 [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
 nohz_full)

On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> On 22/03/24 14:22, Frederic Weisbecker wrote:
>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> Now, on top of the above, there's one more thing worth noting:
>   cpu_up_down_serialize_trainwrecks()
>
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
the change log.

TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
cpu_hotplug_lock in the hotplug path for whatever silly reason. So
trying to flush the work from within the cpu hotplug lock write held
region is guaranteed to dead lock.

That's why all of this got deferred to a work queue. The flush at the
end of the hotplug code after dropping the hotplug lock is there to
prevent that user space sees the CPU hotplug uevent before the work is
done. But of course between bringing the CPU offline and the flush the
kernel internal state is inconsistent.

There was an attempt to make the CPU hotplug path synchronous in commit
a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
reverted with commit 2b729fe7f3 for the very wrong reason:

https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u

 (cpu_hotplug_lock){++++}-{0:0}:
  lock_acquire+0xe4/0x25c
  cpus_read_lock+0x50/0x154
  static_key_slow_inc+0x18/0x30
  mem_cgroup_css_alloc+0x824/0x8b0
  cgroup_apply_control_enable+0x1d8/0x56c
  cgroup_apply_control+0x40/0x344
  cgroup_subtree_control_write+0x664/0x69c
  cgroup_file_write+0x130/0x2e8
  kernfs_fop_write+0x228/0x32c
  __vfs_write+0x84/0x1d8
  vfs_write+0x13c/0x1b4
  ksys_write+0xb0/0x120

Instead of the revert this should have been fixed by doing:

+  cpus_read_lock();
   mutex_lock();
   mem_cgroup_css_alloc();
-  static_key_slow_inc();
+  static_key_slow_inc_cpuslocked();

Sorry that I did not notice this back then because I was too focussed on
fixing that uevent nonsense in a halfways sane way.

After that revert cpuset locking became completely insane.

cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
advanced part of that insanity. Seriously this commit is just tasteless
and disgusting demonstration of how to paper over a problem which is not
fully understood.

After staring at it some more (including the history which led up to
these insanities) I really think that the CPU hotplug path can be made
truly synchronous and the memory hotplug path should just get the lock
ordering correct.

Can we please fix this for real and get all of this insanity out of the
way including the nonsensical comments in the cpuset code:

 * Call with cpuset_mutex held.  Takes cpus_read_lock().

 ...

       lockdep_assert_cpus_held();
       lockdep_assert_held(&cpuset_mutex);

Oh well...

Thanks,

        tglx