lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <75585b6fe937a23e380b5d61df4932e8e87f3485.camel@mediatek.com>
Date: Tue, 1 Apr 2025 05:27:45 +0000
From: Walter Chang (張維哲) <Walter.Chang@...iatek.com>
To: "frederic@...nel.org" <frederic@...nel.org>
CC: wsd_upstream <wsd_upstream@...iatek.com>, "boqun.feng@...il.com"
	<boqun.feng@...il.com>, "vlad.wing@...il.com" <vlad.wing@...il.com>,
	Cheng-Jui Wang (王正睿)
	<Cheng-Jui.Wang@...iatek.com>, "kernel-team@...a.com" <kernel-team@...a.com>,
	Alex Hoh (賀振坤) <Alex.Hoh@...iatek.com>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "anna-maria@...utronix.de"
	<anna-maria@...utronix.de>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "neeraj.upadhyay@....com"
	<neeraj.upadhyay@....com>, "leitao@...ian.org" <leitao@...ian.org>,
	Freddy Hsin (辛恒豐) <Freddy.Hsin@...iatek.com>,
	"urezki@...il.com" <urezki@...il.com>, "tglx@...utronix.de"
	<tglx@...utronix.de>, "qiang.zhang1211@...il.com"
	<qiang.zhang1211@...il.com>, "paulmck@...nel.org" <paulmck@...nel.org>,
	Xinghua Yang (杨兴华) <Xinghua.Yang@...iatek.com>,
	"joel@...lfernandes.org" <joel@...lfernandes.org>, "rcu@...r.kernel.org"
	<rcu@...r.kernel.org>, Chun-Hung Wu (巫駿宏)
	<Chun-hung.Wu@...iatek.com>
Subject: Re: [PATCH v4] hrtimers: Force migrate away hrtimers queued after
 CPUHP_AP_HRTIMERS_DYING

On Wed, 2025-03-26 at 17:44 +0100, Frederic Weisbecker wrote:
> 
> It's not the first time I get such a report on an out of tree
> kernel. The problem is I don't know if the tainted modules are
> involved. But something is probably making an offline CPU visible
> within
> the hierarchy on get_nohz_timer_target(). And that new warning made
> that visible.
> 
> Can you try this and tell us if the warning fires?
> 
> Thanks.
> 
> diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
> index 6d67e9a5af6b..f49512628269 100644
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -9,6 +9,7 @@
>  #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>  extern void nohz_balance_enter_idle(int cpu);
>  extern int get_nohz_timer_target(void);
> +extern void assert_domain_online(void);
>  #else
>  static inline void nohz_balance_enter_idle(int cpu) { }
>  #endif
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 07455d25329c..98c8f8408403 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -13,6 +13,7 @@
>  #include <linux/sched/isolation.h>
>  #include <linux/sched/task.h>
>  #include <linux/sched/smt.h>
> +#include <linux/sched/nohz.h>
>  #include <linux/unistd.h>
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
> @@ -1277,6 +1278,7 @@ static int take_cpu_down(void *_param)
>         if (err < 0)
>                 return err;
> 
> +       assert_domain_online();
>         /*
>          * Must be called from CPUHP_TEARDOWN_CPU, which means, as we
> are going
>          * down, that the current state is CPUHP_TEARDOWN_CPU - 1.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 175a5a7ac107..88157b1645cc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1163,6 +1163,20 @@ void resched_cpu(int cpu)
> 
>  #ifdef CONFIG_SMP
>  #ifdef CONFIG_NO_HZ_COMMON
> +void assert_domain_online(void)
> +{
> +       int cpu = smp_processor_id();
> +       int i;
> +       struct sched_domain *sd;
> +
> +       guard(rcu)();
> +
> +       for_each_domain(cpu, sd) {
> +               for_each_cpu(i, sched_domain_span(sd)) {
> +                       WARN_ON_ONCE(cpu_is_offline(i));
> +               }
> +       }
> +}
>  /*
>   * In the semi idle case, use the nearest busy CPU for migrating
> timers
>   * from an idle CPU.  This is good for power-savings.

Hi Frederic,

Thank you for providing the patch to debug the hrtimer warning issue.

I have applied the patch and conducted stress testing over the weekend.
And the warning provided in the patch did not occur during this period.

Additionally, after a thorough review of our internal tainted modules,
I can confirm that you are correct in your assessment. The
get_nohz_timer_target() with our tainted modules may indeed return a
CPU that is offline, leading to the hrtimer warning issue. We are
working on fixing this within our tainted modules.

Thanks again for your help in debugging this issue.

Best regards,
Walter Chang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ