linux-kernel - Re: [PATCH v2 3/3] locking/csd-lock: Use backoff for repeated reports of same incident

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zqq0M92zcR1kcuKz@LeoBras>
Date: Wed, 31 Jul 2024 19:01:23 -0300
From: Leonardo Bras <leobras@...hat.com>
To: neeraj.upadhyay@...nel.org
Cc: Leonardo Bras <leobras@...hat.com>,
	linux-kernel@...r.kernel.org,
	rcu@...r.kernel.org,
	kernel-team@...a.com,
	rostedt@...dmis.org,
	mingo@...nel.org,
	peterz@...radead.org,
	paulmck@...nel.org,
	imran.f.khan@...cle.com,
	riel@...riel.com,
	tglx@...utronix.de
Subject: Re: [PATCH v2 3/3] locking/csd-lock: Use backoff for repeated reports of same incident

On Mon, Jul 22, 2024 at 07:07:35PM +0530, neeraj.upadhyay@...nel.org wrote:
> From: "Paul E. McKenney" <paulmck@...nel.org>
> 
> Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y
> kernels are emitted at five-second intervals.  Although this has proven
> to be a good time interval for the first diagnostic, if the target CPU
> keeps interrupts disabled for way longer than five seconds, the ratio
> of useful new information to pointless repetition increases considerably.
> 
> Therefore, back off the time period for repeated reports of the same
> incident, increasing linearly with the number of reports and logarithmicly
> with the number of online CPUs.
> 
> [ paulmck: Apply Dan Carpenter feedback. ]
> 
> Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
> Cc: Imran Khan <imran.f.khan@...cle.com>
> Cc: Ingo Molnar <mingo@...nel.org>
> Cc: Leonardo Bras <leobras@...hat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@...radead.org>
> Cc: Rik van Riel <riel@...riel.com>
> Reviewed-by: Rik van Riel <riel@...riel.com>
> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
> ---
>  kernel/smp.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 9385cc05de53..dfcde438ef63 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -225,7 +225,7 @@ bool csd_lock_is_stuck(void)
>   * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
>   * so waiting on other types gets much less information.
>   */
> -static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id)
> +static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id, unsigned long *nmessages)
>  {
>  	int cpu = -1;
>  	int cpux;
> @@ -248,7 +248,9 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
>  	ts2 = sched_clock();
>  	/* How long since we last checked for a stuck CSD lock.*/
>  	ts_delta = ts2 - *ts1;
> -	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
> +	if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
> +			       (!*nmessages ? 1 : (ilog2(num_online_cpus()) / 2 + 1)) ||
> +		   csd_lock_timeout_ns == 0))

I think this is a nice change.

OTOH above condition is quite hard to read IMHO.

IIUC you want, for csd_lock_timeout_ns 5s, and num_online_cpus = 64
1st message: after 5s
2nd message: after 5 * 2 * (6 / 2 + 1) = 10 * 4 = 40s
3rd message: after 5 * 3 * 4 = 60s
...
Is that correct?


I think this could be achieved with:

	/* How long since we last checked for a stuck CSD lock.*/
	ts_delta = ts2 - *ts1;
+	if (*nmessages)
+		csd_lock_timeout_ns *= (*nmessages + 1) * (ilog2(num_online_cpus()) / 2 + 1)
	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
		return false;

Does that look better?

Thanks!
Leo

>  
>  	firsttime = !*bug_id;
> @@ -265,6 +267,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
>  	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n",
>  		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta,
>  		 cpu, csd->func, csd->info);
> +	(*nmessages)++;
>  	if (firsttime)
>  		atomic_inc(&n_csd_lock_stuck);
>  	/*
> @@ -305,12 +308,13 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
>   */
>  static void __csd_lock_wait(call_single_data_t *csd)
>  {
> +	unsigned long nmessages = 0;
>  	int bug_id = 0;
>  	u64 ts0, ts1;
>  
>  	ts1 = ts0 = sched_clock();
>  	for (;;) {
> -		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
> +		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
>  			break;
>  		cpu_relax();
>  	}
> -- 
> 2.40.1
>