linux-kernel - Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201211124148.GW2414@hirez.programming.kicks-ass.net>
Date:   Fri, 11 Dec 2020 13:41:48 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Vincent Donnefort <vincent.donnefort@....com>
Cc:     Valentin Schneider <valentin.schneider@....com>,
        linux-kernel@...r.kernel.org, Qian Cai <cai@...hat.com>,
        tglx@...utronix.de, mingo@...nel.org, bigeasy@...utronix.de,
        qais.yousef@....com, swood@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, tj@...nel.org, ouwen210@...mail.com
Subject: Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during
 late hotplug

On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> Hi Valentin,
> 
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > Per-CPU kworkers forcefully migrated away by hotplug via
> > workqueue_offline_cpu() can end up spawning more kworkers via
> > 
> >   manage_workers() -> maybe_create_worker()
> > 
> > Workers created at this point will be bound using
> > 
> >   pool->attrs->cpumask
> > 
> > which in this case is wrong, as the hotplug state machine already migrated
> > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> > dying rq).
> > 
> > Special-case workers being attached to DISASSOCIATED pools and bind them to
> > cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> > was invoked.
> > 
> > Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
> > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
> 
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
> 
> Previously we had:
> 
>  AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
>    ...
>  TEARDOWN_CPU -> clear CPU in cpu_online_mask
>    |
>    |-AP_SCHED_STARTING -> migrate_tasks()
>    |
>   AP_OFFLINE
> 
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.

Yes, very much so, however the commit Valentin picked was supposed to
preemptively fix this. So we can consider this a fix for the fix.

But I don't mind an alternative or perhaps even second Fixes tag on
this.