linux-kernel - Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZSlhnu0n9eOfkN-U@redhat.com>
Date:   Fri, 13 Oct 2023 12:26:22 -0300
From:   Leonardo Bras <leobras@...hat.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Imran Khan <imran.f.khan@...cle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Juergen Gross <jgross@...e.com>,
        Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too
 long

On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote:
> > Hello Paul,
> > 
> > On 6/10/2023 3:48 am, Paul E. McKenney wrote:
> > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> > > temporarily, it usually gets released in a few seconds, and sometimes
> > > up to one or two minutes.
> > > 
> > > If the CSD lock stays stuck for more than several minutes, it never
> > > seems to get unstuck, and gradually more and more things in the system
> > > end up also getting stuck.
> > > 
> > > In the latter case, we should just give up, so the system can dump out
> > > a little more information about what went wrong, and, with panic_on_oops
> > > and a kdump kernel loaded, dump a whole bunch more information about
> > > what might have gone wrong.
> > > 
> > > Question: should this have its own panic_on_ipistall switch in
> > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
> > > way than via BUG_ON?
> > > 
> > panic_on_ipistall (set to 1 by default) looks better option to me. For systems
> > where such delay is acceptable and system can eventually get back to sane state,
> > this option (set to 0 after boot) would prevent crashing the system for
> > apparently benign CSD hangs of long duration.
> 
> Good point!  How about like the following?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 6bcf3786291b86f13b3e13d51e998737a8009ec3
> Author: Rik van Riel <riel@...riel.com>
> Date:   Mon Aug 21 16:04:09 2023 -0400
> 
>     smp,csd: Throw an error if a CSD lock is stuck for too long
>     
>     The CSD lock seems to get stuck in 2 "modes". When it gets stuck
>     temporarily, it usually gets released in a few seconds, and sometimes
>     up to one or two minutes.
>     
>     If the CSD lock stays stuck for more than several minutes, it never
>     seems to get unstuck, and gradually more and more things in the system
>     end up also getting stuck.
>     
>     In the latter case, we should just give up, so the system can dump out
>     a little more information about what went wrong, and, with panic_on_oops
>     and a kdump kernel loaded, dump a whole bunch more information about what
>     might have gone wrong.  In addition, there is an smp.panic_on_ipistall
>     kernel boot parameter that by default retains the old behavior, but when
>     set enables the panic after the CSD lock has been stuck for more than
>     five minutes.
>     
>     [ paulmck: Apply Imran Khan feedback. ]
>     
>     Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/
>     Signed-off-by: Rik van Riel <riel@...riel.com>
>     Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
>     Cc: Peter Zijlstra <peterz@...radead.org>
>     Cc: Valentin Schneider <vschneid@...hat.com>
>     Cc: Juergen Gross <jgross@...e.com>
>     Cc: Jonathan Corbet <corbet@....net>
>     Cc: Randy Dunlap <rdunlap@...radead.org>
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0a1731a0f0ef..592935267ce2 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5858,6 +5858,11 @@
>  			This feature may be more efficiently disabled
>  			using the csdlock_debug- kernel parameter.
>  
> +	smp.panic_on_ipistall= [KNL]
> +			If a csd_lock_timeout extends for more than
> +			five minutes, panic the system.  By default, let
> +			CSD-lock acquisition take as long as they take.
> +

It could be interesting to have it as an s64 parameter (in {mili,}seconds) 
instead of bool, this way the user could pick the time to wait before the 
panic happens. 0 or -1 could mean disabled.

What do you think?

Other than that,
Reviewed-by: Leonardo Bras <leobras@...hat.com>


>  	smsc-ircc2.nopnp	[HW] Don't use PNP to discover SMC devices
>  	smsc-ircc2.ircc_cfg=	[HW] Device configuration I/O port
>  	smsc-ircc2.ircc_sir=	[HW] SIR base I/O port
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 8455a53465af..b6a0773a7015 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info);
>  
>  static ulong csd_lock_timeout = 5000;  /* CSD lock timeout in milliseconds. */
>  module_param(csd_lock_timeout, ulong, 0444);
> +static bool panic_on_ipistall;
> +module_param(panic_on_ipistall, bool, 0444);
>  
>  static atomic_t csd_bug_count = ATOMIC_INIT(0);
>  
> @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>  	}
>  
>  	ts2 = sched_clock();
> +	/* How long since we last checked for a stuck CSD lock.*/
>  	ts_delta = ts2 - *ts1;
>  	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
>  		return false;
> @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>  	else
>  		cpux = cpu;
>  	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
> +	/* How long since this CSD lock was stuck. */
> +	ts_delta = ts2 - ts0;
>  	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
> -		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
> +		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
>  		 cpu, csd->func, csd->info);
> +	/*
> +	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
> +	 * to become unstuck. Use a signed comparison to avoid triggering
> +	 * on underflows when the TSC is out of sync between sockets.
> +	 */
> +	BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL);
>  	if (cpu_cur_csd && csd != cpu_cur_csd) {
>  		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
>  			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
>