linux-kernel - Re: [PATCH] workqueue: fix rebind bound workers warning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANRm+CyxpKdQvT7evcwuV=QUM1XWpv8WxhKJOyAFHez8eT=YAw@mail.gmail.com>
Date:	Wed, 11 May 2016 16:05:48 +0800
From:	Wanpeng Li <kernellwp@...il.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	Tejun Heo <tj@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Wanpeng Li <wanpeng.li@...mail.com>,
	Lai Jiangshan <jiangshanlai@...il.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Frédéric Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH] workqueue: fix rebind bound workers warning

2016-05-11 15:34 GMT+08:00 Thomas Gleixner <tglx@...utronix.de>:
> On Wed, 11 May 2016, Wanpeng Li wrote:
>> Hi Tejun,
>> 2016-05-10 5:50 GMT+08:00 Wanpeng Li <kernellwp@...il.com>:
>> > Cc Thomas, the new state machine author,
>> > 2016-05-10 1:00 GMT+08:00 Tejun Heo <tj@...nel.org>:
>> >> Hello,
>> >>
>> >> On Thu, May 05, 2016 at 09:41:31AM +0800, Wanpeng Li wrote:
>> >>> The boot CPU handles housekeeping duty(unbound timers, workqueues,
>> >>> timekeeping, ...) on behalf of full dynticks CPUs. It must remain
>> >>> online when nohz full is enabled. There is a priority set to every
>> >>> notifier_blocks:
>> >>>
>> >>> workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
>> >>>
>> >>> So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
>> >>> notifier_blocks behind tick_nohz_cpu_down will not be called any
>> >>> more, which leads to workers are actually not unbound. Then hotplug
>> >>> state machine will fallback to undo and online cpu 0 again. Workers
>> >>> will be rebound unconditionally even if they are not unbound and
>> >>> trigger the warning in this progress.
>> >>
>> >> I'm a bit confused.  Are you saying that the hotplug statemachine may
>> >> invoke CPU_DOWN_FAILED w/o preceding CPU_DOWN on the same callback?
>> >
>> > I think so. CPU_DOWN_FAILED is detected in the process of CPU_DOWN_PREPARE
>
> Well, no. It's not detected.
>
> If a down prepare callback fails, then DOWN_FAILED is invoked for all
> callbacks which have successfully executed DOWN_PREPARE.
>
> But, workqueue has actually two notifiers. One which handles
> UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
>
> Now look at the priorities of those callbacks:
>
>     CPU_PRI_WORKQUEUE_UP        = 5
>     CPU_PRI_WORKQUEUE_DOWN      = -5
>
> So the call order on DOWN_PREPARE is:
>
>    CB 1
>    CB ...
>    CB workqueue_up() -> Ignores DOWN_PREPARE
>    CB ...
>    CB X ---> Fails
>
> So we call up to CB X with DOWN_FAILED
>
>    CB 1
>    CB ...
>    CB workqueue_up() -> Handles DOWN_FAILED
>    CB ...
>    CB X-1
>
> So the problem is that the workqueue stuff handles DOWN_FAILED in the up
> callback, while it should do it in the down callback. Which is not a good idea
> either because it wants to be called early on rollback...
>
> Brilliant stuff, isn't it? The hotplug rework will solve this problem because
> the callbacks become symetric, but for the existing mess, we need some
> workaround in the workqueue code.

Thanks for your reply, Thomas. :-)
Do you think the current version patch is the right fix/workaround for
the existing mess?

Regards,
Wanpeng Li