linux-kernel - Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZgYikMb5kZ7rxPp6@slm.duckdns.org>
Date: Thu, 28 Mar 2024 16:08:16 -1000
From: Tejun Heo <tj@...nel.org>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
	Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Alex Shi <alexs@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Barry Song <song.bao.hua@...ilicon.com>,
	Waiman Long <longman@...hat.com>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
 [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
 nohz_full)

(cc'ing Waiman and quoting whole body)

Hello,

On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
> On 27/03/24 21:42, Thomas Gleixner wrote:
> > On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> >> On 22/03/24 14:22, Frederic Weisbecker wrote:
> >>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> >> Now, on top of the above, there's one more thing worth noting:
> >>   cpu_up_down_serialize_trainwrecks()
> >>
> >> This just flushes the cpuset work, so after that the sched_domain topology
> >> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> >> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> >> around this vs. lock ordering aren't very reassuring however, so I need to
> >> look into this more.
> >
> > commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> > the change log.
> >
> > TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> > cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> > trying to flush the work from within the cpu hotplug lock write held
> > region is guaranteed to dead lock.
> >
> > That's why all of this got deferred to a work queue. The flush at the
> > end of the hotplug code after dropping the hotplug lock is there to
> > prevent that user space sees the CPU hotplug uevent before the work is
> > done. But of course between bringing the CPU offline and the flush the
> > kernel internal state is inconsistent.
> >
> 
> Thanks for the summary!
> 
> > There was an attempt to make the CPU hotplug path synchronous in commit
> > a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> > reverted with commit 2b729fe7f3 for the very wrong reason:
> >
> > https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
> >
> >  (cpu_hotplug_lock){++++}-{0:0}:
> >   lock_acquire+0xe4/0x25c
> >   cpus_read_lock+0x50/0x154
> >   static_key_slow_inc+0x18/0x30
> >   mem_cgroup_css_alloc+0x824/0x8b0
> >   cgroup_apply_control_enable+0x1d8/0x56c
> >   cgroup_apply_control+0x40/0x344
> >   cgroup_subtree_control_write+0x664/0x69c
> >   cgroup_file_write+0x130/0x2e8
> >   kernfs_fop_write+0x228/0x32c
> >   __vfs_write+0x84/0x1d8
> >   vfs_write+0x13c/0x1b4
> >   ksys_write+0xb0/0x120
> >
> > Instead of the revert this should have been fixed by doing:
> >
> > +  cpus_read_lock();
> >    mutex_lock();
> >    mem_cgroup_css_alloc();
> > -  static_key_slow_inc();
> > +  static_key_slow_inc_cpuslocked();
> >
> 
> So looking at the state of things now, writing to the
> cgroup.subtree_control file looks like: 
> 
>   cgroup_file_write()
>   `\
>     cgroup_subtree_control_write()
>     `\
>       cgroup_kn_lock_live()
>       `\
>       | cgroup_lock()
>       | `\
>       |   mutex_lock(&cgroup_mutex);
>       |
>       cgroup_apply_control()
>       `\
>         cgroup_apply_control_enable()
>         `\
>           css_create()
>           `\
>             ss->css_alloc() [mem_cgroup_css_alloc()]
>             `\
>               static_branch_inc()        
> 
> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
> the revert, we'd probably want to go for:
> 
>   static inline void cgroup_lock(void)
>   {
> +       cpus_read_lock();
> 	mutex_lock(&cgroup_mutex);
>   }
> 
>   static inline void cgroup_unlock(void)
>   {
> 	mutex_unlock(&cgroup_mutex);
> -       cpus_read_unlock();        
>   }
> 
> + a handful of +_cpuslocked() changes.
> 
> As for cpuset, it looks like it's following this lock order:
> 
> 	cpus_read_lock();
> 	mutex_lock(&cpuset_mutex);
> 
> Which AFAICT is what we want.
> 
> > Sorry that I did not notice this back then because I was too focussed on
> > fixing that uevent nonsense in a halfways sane way.
> >
> > After that revert cpuset locking became completely insane.
> >
> > cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> > advanced part of that insanity. Seriously this commit is just tasteless
> > and disgusting demonstration of how to paper over a problem which is not
> > fully understood.
> >
> > After staring at it some more (including the history which led up to
> > these insanities) I really think that the CPU hotplug path can be made
> > truly synchronous and the memory hotplug path should just get the lock
> > ordering correct.
> >
> > Can we please fix this for real and get all of this insanity out of the
> > way
> 
> Yes please!

Yeah, making that operation synchronous would be nice. Waiman, you're a lot
more familiar with this part than I am. Can you please take a look and see
whether we can turn the sched domain updates synchronous?

Thanks.

-- 
tejun