linux-kernel - Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a459fe36-3077-48f1-bcd4-63a07f4866f3@paulmck-laptop>
Date:   Fri, 6 Oct 2023 13:26:00 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Jonas Oberhauser <jonas.oberhauser@...weicloud.com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Juergen Gross <jgross@...e.com>,
        Leonardo Bras <leobras@...hat.com>,
        Imran Khan <imran.f.khan@...cle.com>
Subject: Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too
 long

On Fri, Oct 06, 2023 at 08:48:23PM +0200, Jonas Oberhauser wrote:
> Is this related to the qspinlock issue you described earlier?

Kind of in that sometimes qspinlock issues trigger CSD-lock warnings,
but not really directly related.

							Thanx, Paul

> jonas
> 
> 
> Am 10/5/2023 um 6:48 PM schrieb Paul E. McKenney:
> > The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> > temporarily, it usually gets released in a few seconds, and sometimes
> > up to one or two minutes.
> > 
> > If the CSD lock stays stuck for more than several minutes, it never
> > seems to get unstuck, and gradually more and more things in the system
> > end up also getting stuck.
> > 
> > In the latter case, we should just give up, so the system can dump out
> > a little more information about what went wrong, and, with panic_on_oops
> > and a kdump kernel loaded, dump a whole bunch more information about
> > what might have gone wrong.
> > 
> > Question: should this have its own panic_on_ipistall switch in
> > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
> > way than via BUG_ON?
> > 
> > Signed-off-by: Rik van Riel <riel@...riel.com>
> > Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 8455a53465af..059f1f53fc6b 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> >   	}
> >   	ts2 = sched_clock();
> > +	/* How long since we last checked for a stuck CSD lock.*/
> >   	ts_delta = ts2 - *ts1;
> >   	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
> >   		return false;
> > @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> >   	else
> >   		cpux = cpu;
> >   	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
> > +	/* How long since this CSD lock was stuck. */
> > +	ts_delta = ts2 - ts0;
> >   	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
> > -		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
> > +		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
> >   		 cpu, csd->func, csd->info);
> > +	/*
> > +	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
> > +	 * to become unstuck. Use a signed comparison to avoid triggering
> > +	 * on underflows when the TSC is out of sync between sockets.
> > +	 */
> > +	BUG_ON((s64)ts_delta > 300000000000LL);
> >   	if (cpu_cur_csd && csd != cpu_cur_csd) {
> >   		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
> >   			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
>