[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YFh4kWFZTw4wSOq3@alley>
Date: Mon, 22 Mar 2021 11:59:29 +0100
From: Petr Mladek <pmladek@...e.com>
To: Wang Qing <wangqing@...o.com>
Cc: Tejun Heo <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
"Guilherme G. Piccoli" <gpiccoli@...onical.com>,
Andrey Ignatov <rdna@...com>, Vlastimil Babka <vbabka@...e.cz>,
Santosh Sivaraj <santosh@...six.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for
unbound lockup checking
On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu
> updated, while the unbound worker_pool running on its core uses
> wq_watchdog_touched to determine whether locked up. This may be mischecked.
By other words, unbound workqueues are not aware of the more common
touch_softlockup_watchdog() because it updates only
wq_watchdog_touched_cpu for the affected CPU. As a result,
the workqueue watchdog might report lockup in unbound workqueue
even though it is blocked by a known slow code.
> My suggestion is to update both when touch_softlockup_watchdog() is called,
> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched
> to check unbound worker_pool.
>
> Signed-off-by: Wang Qing <wangqing@...o.com>
> ---
> kernel/watchdog.c | 5 +++--
> kernel/workqueue.c | 17 ++++++-----------
> 2 files changed, 9 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 7110906..107bc38
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
> * update as well, the only side effect might be a cycle delay for
> * the softlockup check.
> */
> - for_each_cpu(cpu, &watchdog_allowed_mask)
> + for_each_cpu(cpu, &watchdog_allowed_mask) {
> per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> - wq_watchdog_touch(-1);
> + wq_watchdog_touch(cpu);
Note that wq_watchdog_touch(cpu) newly always updates
wq_watchdog_touched. This cycle will set the same jiffies
value cpu-times to the same variable.
> + }
> }
>
> void touch_softlockup_watchdog_sync(void)
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 0d150da..be08295
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
> continue;
>
> /* get the latest of pool and touched timestamps */
> + if (pool->cpu >= 0)
> + touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
> + else
> + touched = READ_ONCE(wq_watchdog_touched);
> pool_ts = READ_ONCE(pool->watchdog_ts);
> - touched = READ_ONCE(wq_watchdog_touched);
>
> if (time_after(pool_ts, touched))
> ts = pool_ts;
> else
> ts = touched;
>
> - if (pool->cpu >= 0) {
> - unsigned long cpu_touched =
> - READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
> - pool->cpu));
> - if (time_after(cpu_touched, ts))
> - ts = cpu_touched;
> - }
> -
> /* did we stall? */
> if (time_after(jiffies, ts + thresh)) {
> lockup_detected = true;
> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
> {
> if (cpu >= 0)
> per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> - else
> - wq_watchdog_touched = jiffies;
> +
> + wq_watchdog_touched = jiffies;
> }
>
> static void wq_watchdog_set_thresh(unsigned long thresh)
This last hunk is enough to fix the problem. wq_watchdog_touched will
get updated also from cpu-specific touch_softlockup_watchdog().
The original patch simplified the logic of wq_watchdog_timer_fn().
But it added un-necessary assignments into
touch_all_softlockup_watchdogs(void).
I do not have strong opinion what solution is better. I slightly
prefer to keep only this last hunk.
Best Regards,
Petr
Powered by blists - more mailing lists