[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CyxpKdQvT7evcwuV=QUM1XWpv8WxhKJOyAFHez8eT=YAw@mail.gmail.com>
Date: Wed, 11 May 2016 16:05:48 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Tejun Heo <tj@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Wanpeng Li <wanpeng.li@...mail.com>,
Lai Jiangshan <jiangshanlai@...il.com>,
Peter Zijlstra <peterz@...radead.org>,
Frédéric Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH] workqueue: fix rebind bound workers warning
2016-05-11 15:34 GMT+08:00 Thomas Gleixner <tglx@...utronix.de>:
> On Wed, 11 May 2016, Wanpeng Li wrote:
>> Hi Tejun,
>> 2016-05-10 5:50 GMT+08:00 Wanpeng Li <kernellwp@...il.com>:
>> > Cc Thomas, the new state machine author,
>> > 2016-05-10 1:00 GMT+08:00 Tejun Heo <tj@...nel.org>:
>> >> Hello,
>> >>
>> >> On Thu, May 05, 2016 at 09:41:31AM +0800, Wanpeng Li wrote:
>> >>> The boot CPU handles housekeeping duty(unbound timers, workqueues,
>> >>> timekeeping, ...) on behalf of full dynticks CPUs. It must remain
>> >>> online when nohz full is enabled. There is a priority set to every
>> >>> notifier_blocks:
>> >>>
>> >>> workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
>> >>>
>> >>> So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
>> >>> notifier_blocks behind tick_nohz_cpu_down will not be called any
>> >>> more, which leads to workers are actually not unbound. Then hotplug
>> >>> state machine will fallback to undo and online cpu 0 again. Workers
>> >>> will be rebound unconditionally even if they are not unbound and
>> >>> trigger the warning in this progress.
>> >>
>> >> I'm a bit confused. Are you saying that the hotplug statemachine may
>> >> invoke CPU_DOWN_FAILED w/o preceding CPU_DOWN on the same callback?
>> >
>> > I think so. CPU_DOWN_FAILED is detected in the process of CPU_DOWN_PREPARE
>
> Well, no. It's not detected.
>
> If a down prepare callback fails, then DOWN_FAILED is invoked for all
> callbacks which have successfully executed DOWN_PREPARE.
>
> But, workqueue has actually two notifiers. One which handles
> UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
>
> Now look at the priorities of those callbacks:
>
> CPU_PRI_WORKQUEUE_UP = 5
> CPU_PRI_WORKQUEUE_DOWN = -5
>
> So the call order on DOWN_PREPARE is:
>
> CB 1
> CB ...
> CB workqueue_up() -> Ignores DOWN_PREPARE
> CB ...
> CB X ---> Fails
>
> So we call up to CB X with DOWN_FAILED
>
> CB 1
> CB ...
> CB workqueue_up() -> Handles DOWN_FAILED
> CB ...
> CB X-1
>
> So the problem is that the workqueue stuff handles DOWN_FAILED in the up
> callback, while it should do it in the down callback. Which is not a good idea
> either because it wants to be called early on rollback...
>
> Brilliant stuff, isn't it? The hotplug rework will solve this problem because
> the callbacks become symetric, but for the existing mess, we need some
> workaround in the workqueue code.
Thanks for your reply, Thomas. :-)
Do you think the current version patch is the right fix/workaround for
the existing mess?
Regards,
Wanpeng Li
Powered by blists - more mailing lists