linux-kernel - Re: [PATCH v3 3/4] kgdb: Don't round up a CPU that failed rounding up before

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181107115953.cnar4u2axi4poaxe@holly.lan>
Date:   Wed, 7 Nov 2018 11:59:53 +0000
From:   Daniel Thompson <daniel.thompson@...aro.org>
To:     Douglas Anderson <dianders@...omium.org>
Cc:     Jason Wessel <jason.wessel@...driver.com>,
        kgdb-bugreport@...ts.sourceforge.net,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 3/4] kgdb: Don't round up a CPU that failed rounding
 up before

On Tue, Nov 06, 2018 at 05:00:27PM -0800, Douglas Anderson wrote:
> If we're using the default implementation of kgdb_roundup_cpus() that
> uses smp_call_function_single_async() we can end up hanging
> kgdb_roundup_cpus() if we try to round up a CPU that failed to round
> up before.
> 
> Specifically smp_call_function_single_async() will try to wait on the
> csd lock for the CPU that we're trying to round up.  If the previous
> round up never finished then that lock could still be held and we'll
> just sit there hanging.
> 
> There's not a lot of use trying to round up a CPU that failed to round
> up before.  Let's keep a flag that indicates whether the CPU started
> but didn't finish to round up before.  If we see that flag set then
> we'll skip the next round up.
> 
> In general we have a few goals here:
> - We never want to end up calling smp_call_function_single_async()
>   when the csd is still locked.  This is accomplished because
>   flush_smp_call_function_queue() unlocks the csd _before_ invoking
>   the callback.  That means that when kgdb_nmicallback() runs we know
>   for sure the the csd is no longer locked.  Thus when we set
>   "rounding_up = false" we know for sure that the csd is unlocked.
> - If there are no timeouts rounding up we should never skip a round
>   up.
> 
> NOTE #1: In general trying to continue running after failing to round
> up CPUs doesn't appear to be supported in the debugger.  When I
> simulate this I find that kdb reports "Catastrophic error detected"
> when I try to continue.  I can overrule and continue anyway, but it
> should be noted that we may be entering the land of dragons here.

It's been quite a while but AFAIR I decided to set the catastrophic
error here *because* the stuck csd lock would make restarting fragile.

So arguably we are now able to remove the code that sets this flag when
a CPU fails to round up.


> NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in
> kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some
> implementations override kgdb_call_nmi_hook().  It shouldn't hurt to
> have it in kgdb_nmicallback() in any case.

Slightly icky but I guess this is OK.

> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 23f2b5613afa..324cba8917f1 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -246,6 +246,20 @@ void __weak kgdb_roundup_cpus(void)
>  			continue;
>  
>  		csd = &per_cpu(kgdb_roundup_csd, cpu);
> +
> +		/*
> +		 * If it didn't round up last time, don't try again
> +		 * since smp_call_function_single_async() will block.
> +		 *
> +		 * If rounding_up is false then we know that the
> +		 * previous call must have at least started and that
> +		 * means smp_call_function_single_async() won't block.
> +		 */
> +		smp_mb();

Not commented and I suspect this may have no useful effect. What 
(harmful) orderings does this barrier render impossible?


> +		if (kgdb_info[cpu].rounding_up)
> +			continue;
> +		kgdb_info[cpu].rounding_up = true;
> +
>  		csd->func = kgdb_call_nmi_hook;
>  		smp_call_function_single_async(cpu, csd);
>  	}
> @@ -782,6 +796,9 @@ int kgdb_nmicallback(int cpu, void *regs)
>  	struct kgdb_state kgdb_var;
>  	struct kgdb_state *ks = &kgdb_var;
>  
> +	kgdb_info[cpu].rounding_up = false;
> +	smp_mb();

Also not commented. Here I think the barrier may have a purpose (to
ensure rounding_up gets cleared before we peek at dbg_master_lock) but
if that is the case we need to comment it.
 

Daniel.