linux-kernel - Re:Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ADYALADXDuGCRBjCE77aRKpD.3.1616552206707.Hmail.wangqing@vivo.com>
Date:   Wed, 24 Mar 2021 10:16:46 +0800 (GMT+08:00)
From:   王擎 <wangqing@...o.com>
To:     Petr Mladek <pmladek@...e.com>, Tejun Heo <tj@...nel.org>
Cc:     Lai Jiangshan <jiangshanlai@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Guilherme G. Piccoli" <gpiccoli@...onical.com>,
        Andrey Ignatov <rdna@...com>, Vlastimil Babka <vbabka@...e.cz>,
        Santosh Sivaraj <santosh@...six.org>,
        linux-kernel@...r.kernel.org
Subject: Re:Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking


>On Tue 2021-03-23 20:37:35, 王擎 wrote:
>> 
>> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
>> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
>> >> updated, while the unbound worker_pool running on its core uses 
>> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
>> >
>> >By other words, unbound workqueues are not aware of the more common
>> >touch_softlockup_watchdog() because it updates only
>> >wq_watchdog_touched_cpu for the affected CPU. As a result,
>> >the workqueue watchdog might report lockup in unbound workqueue
>> >even though it is blocked by a known slow code.
>> 
>> Yes, this is the problem I'm talking about.
>
>I thought more about it. This patch prevents a false positive.
>Could it bring an opposite problem and hide real problems?
>
>I mean that an unbound workqueue might get blocked on CPU A
>because of a real softlockup. But we might not notice it because
>CPU B is touched. Well, there are other ways how to detect
>this situation, e.g. the softlockup watchdog.
>
>
>> >> My suggestion is to update both when touch_softlockup_watchdog() is called, 
>> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
>> >> to check unbound worker_pool.
>> >> 
>> >> Signed-off-by: Wang Qing <wangqing@...o.com>
>> >> ---
>> >>  kernel/watchdog.c  |  5 +++--
>> >>  kernel/workqueue.c | 17 ++++++-----------
>> >>  2 files changed, 9 insertions(+), 13 deletions(-)
>> >> 
>> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> >> index 7110906..107bc38
>> >> --- a/kernel/watchdog.c
>> >> +++ b/kernel/watchdog.c
>> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>> >>  	 * update as well, the only side effect might be a cycle delay for
>> >>  	 * the softlockup check.
>> >>  	 */
>> >> -	for_each_cpu(cpu, &watchdog_allowed_mask)
>> >> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>> >>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>> >> -	wq_watchdog_touch(-1);
>> >> +		wq_watchdog_touch(cpu);
>> >
>> >Note that wq_watchdog_touch(cpu) newly always updates
>> >wq_watchdog_touched. This cycle will set the same jiffies
>> >value cpu-times to the same variable.
>> >
>> Although there is a bit of redundancy here, but the most concise way of 
>> implementation, and it is certain that it will not affect performance.
>> 
Another way to implement is wq_watchdog_touch() remain unchanged, but need 
to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
notrace void touch_softlockup_watchdog(void)
{
	touch_softlockup_watchdog_sched();
	wq_watchdog_touch(raw_smp_processor_id());
+     wq_watchdog_touch(-1);
}
void touch_all_softlockup_watchdogs(void)
 * update as well, the only side effect might be a cycle delay for
 * the softlockup check.
*/
 -	for_each_cpu(cpu, &watchdog_allowed_mask)
+	for_each_cpu(cpu, &watchdog_allowed_mask) {
  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
 +		wq_watchdog_touch(cpu);
 +	}
 	wq_watchdog_touch(-1);
  }
So wq_watchdog_touched will not get updated many times,  
which do you think is better, Petr?
>> >>  
>> >>  void touch_softlockup_watchdog_sync(void)
>> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> >> index 0d150da..be08295
>> >> --- a/kernel/workqueue.c
>> >> +++ b/kernel/workqueue.c
>> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
>> >>  {
>> >>  	if (cpu >= 0)
>> >>  		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
>> >> -	else
>> >> -		wq_watchdog_touched = jiffies;
>> >> +
>> >> +	wq_watchdog_touched = jiffies;
>> >>  }
>> >>  
>> >>  static void wq_watchdog_set_thresh(unsigned long thresh)
>> >
>> >This last hunk is enough to fix the problem. wq_watchdog_touched will
>> >get updated also from cpu-specific touch_softlockup_watchdog().
>> 
>> Not enough in fact, because wq_watchdog_touched was updated in wq_watchdog_touch(), 
>> so wq_watchdog_touched can no longer be used to judge the bound pool, we must update 
>> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound judgment.
>
>I see. Your patch makes sense.
>
>I would just improve the commit message. It should better explain
>why all the twists are needed. I would write something like:
>
><proposal>
>Subject: workqueue/watchdog: Make unbound workqueues aware of
>touch_softlockup_watchdog()
>
>There are two workqueue-specific watchdog timestamps:
>
>    + @wq_watchdog_touched_cpu (per-CPU) updated by
>      touch_softlockup_watchdog()
>
>    + @wq_watchdog_touched (global) updated by
>      touch_all_softlockup_watchdogs()
>
>watchdog_timer_fn() checks only the global @wq_watchdog_touched for
>unbound workqueues. As a result, unbound workqueues are not aware
>of touch_softlockup_watchdog(). The watchdog might report a stall
>even when the unbound workqueues are blocked by a known slow code.
>
>Solution:
>
>touch_softlockup_watchdog() must touch also the global
>@wq_watchdog_touched timestamp.
>
>The global timestamp can not longer be used for bound workqueues
>because it is updated on all CPUs. Instead, bound workqueues
>have to check only @wq_watchdog_touched_cpu and these timestamp
>has to be updated for all CPUs in touch_all_softlockup_watchdogs().
>
>Beware:
>
>The change might cause the opposite problem. An unbound workqueue
>might get blocked on CPU A because of a real softlockup. The workqueue
>watchdog would miss it when the timestamp got touched on CPU B.
>
>It is acceptable because softlockups are detected by softlockup
>watchdog. The workqueue watchdog is there to detect stalls where
>a work never finishes, for example, because of dependencies of works
>queued into the same workqueue.
></proposal>
>
>How does that sound?

It's fine to me, what's your opinion, Tejun?

Thanks,
Qing
>
>Best Regards,
>Petr