[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201211124148.GW2414@hirez.programming.kicks-ass.net>
Date: Fri, 11 Dec 2020 13:41:48 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Vincent Donnefort <vincent.donnefort@....com>
Cc: Valentin Schneider <valentin.schneider@....com>,
linux-kernel@...r.kernel.org, Qian Cai <cai@...hat.com>,
tglx@...utronix.de, mingo@...nel.org, bigeasy@...utronix.de,
qais.yousef@....com, swood@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
bristot@...hat.com, tj@...nel.org, ouwen210@...mail.com
Subject: Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during
late hotplug
On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > Per-CPU kworkers forcefully migrated away by hotplug via
> > workqueue_offline_cpu() can end up spawning more kworkers via
> >
> > manage_workers() -> maybe_create_worker()
> >
> > Workers created at this point will be bound using
> >
> > pool->attrs->cpumask
> >
> > which in this case is wrong, as the hotplug state machine already migrated
> > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> > dying rq).
> >
> > Special-case workers being attached to DISASSOCIATED pools and bind them to
> > cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> > was invoked.
> >
> > Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
> > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
> AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
> ...
> TEARDOWN_CPU -> clear CPU in cpu_online_mask
> |
> |-AP_SCHED_STARTING -> migrate_tasks()
> |
> AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.
Yes, very much so, however the commit Valentin picked was supposed to
preemptively fix this. So we can consider this a fix for the fix.
But I don't mind an alternative or perhaps even second Fixes tag on
this.
Powered by blists - more mailing lists