linux-kernel - cpu hotplug : was: Re: [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZFEqynvf5nqkzEvQ@alley>
Date:   Tue, 2 May 2023 17:23:45 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Douglas Anderson <dianders@...omium.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Mark Rutland <mark.rutland@....com>,
        Randy Dunlap <rdunlap@...radead.org>,
        Will Deacon <will@...nel.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Sumit Garg <sumit.garg@...aro.org>,
        Daniel Thompson <daniel.thompson@...aro.org>,
        Ian Rogers <irogers@...gle.com>, ravi.v.shankar@...el.com,
        Marc Zyngier <maz@...nel.org>,
        linux-perf-users@...r.kernel.org,
        Stephane Eranian <eranian@...gle.com>,
        kgdb-bugreport@...ts.sourceforge.net, ito-yuichi@...itsu.com,
        linux-arm-kernel@...ts.infradead.org,
        Stephen Boyd <swboyd@...omium.org>,
        Masayoshi Mizuma <msys.mizuma@...il.com>,
        ricardo.neri@...el.com, Lecopzer Chen <lecopzer.chen@...iatek.com>,
        Chen-Yu Tsai <wens@...e.org>, Andi Kleen <ak@...ux.intel.com>,
        Colin Cross <ccross@...roid.com>,
        Matthias Kaehlcke <mka@...omium.org>,
        Guenter Roeck <groeck@...omium.org>,
        Tzung-Bi Shih <tzungbi@...omium.org>,
        Alexander Potapenko <glider@...gle.com>,
        AngeloGioacchino Del Regno 
        <angelogioacchino.delregno@...labora.com>,
        Geert Uytterhoeven <geert+renesas@...der.be>,
        Juergen Gross <jgross@...e.com>,
        Kees Cook <keescook@...omium.org>,
        Laurent Dufour <ldufour@...ux.ibm.com>,
        Liam Howlett <liam.howlett@...cle.com>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Matthias Brugger <matthias.bgg@...il.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Miguel Ojeda <ojeda@...nel.org>,
        Nathan Chancellor <nathan@...nel.org>,
        Nick Desaulniers <ndesaulniers@...gle.com>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Sami Tolvanen <samitolvanen@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Zhaoyang Huang <zhaoyang.huang@...soc.com>,
        Zhen Lei <thunder.leizhen@...wei.com>,
        linux-kernel@...r.kernel.org, linux-mediatek@...ts.infradead.org
Subject: cpu hotplug : was: Re: [PATCH v3] hardlockup: detect hard lockups
 using secondary (buddy) CPUs

On Mon 2023-05-01 08:24:46, Douglas Anderson wrote:
> From: Colin Cross <ccross@...roid.com>
> 
> Implement a hardlockup detector that doesn't doesn't need any extra
> arch-specific support code to detect lockups. Instead of using
> something arch-specific we will use the buddy system, where each CPU
> watches out for another one. Specifically, each CPU will use its
> softlockup hrtimer to check that the next CPU is processing hrtimer
> interrupts by verifying that a counter is increasing.
> 
> --- /dev/null
> +++ b/kernel/watchdog_buddy_cpu.c
> +int watchdog_nmi_enable(unsigned int cpu)
> +{
> +	/*
> +	 * The new CPU will be marked online before the first hrtimer interrupt
> +	 * runs on it.

It does not need to be the first hrtimer interrupt. The CPU might have
been offlined/onlined repeatedly. The counter might have any value.

> +      * If another CPU tests for a hardlockup on the new CPU
> +	 * before it has run its first hrtimer, it will get a false positive.
> +	 * Touch the watchdog on the new CPU to delay the first check for at
> +	 * least 3 sampling periods to guarantee one hrtimer has run on the new
> +	 * CPU.
> +	 */
> +	per_cpu(watchdog_touch, cpu) = true;

We should touch also the next_cpu:

	/*
	 * We are going to check the next CPU. Our watchdog_hrtimer
	 * need not be zero if the CPU has already been online earlier.
	 * Touch the watchdog on the next CPU to avoid false positive
	 * if we try to check it in less then 3 interrupts.
	 */
	next_cpu = watchdog_next_cpu(cpu);
	if (next_cpu < nr_cpu_ids)
		per_cpu(watchdog_touch, next_cpu) = true;

Alternative would be to clear watchdog_hrtimer. But it would kind-of
affect also the softlockup detector.


> +	/* Match with smp_rmb() in watchdog_check_hardlockup() */
> +	smp_wmb();
> +	cpumask_set_cpu(cpu, &watchdog_cpus);
> +	return 0;
> +}
> +
> +void watchdog_nmi_disable(unsigned int cpu)
> +{
> +	unsigned int next_cpu = watchdog_next_cpu(cpu);
> +
> +	/*
> +	 * Offlining this CPU will cause the CPU before this one to start
> +	 * checking the one after this one. If this CPU just finished checking
> +	 * the next CPU and updating hrtimer_interrupts_saved, and then the
> +	 * previous CPU checks it within one sample period, it will trigger a
> +	 * false positive. Touch the watchdog on the next CPU to prevent it.
> +	 */
> +	if (next_cpu < nr_cpu_ids)
> +		per_cpu(watchdog_touch, next_cpu) = true;
> +	/* Match with smp_rmb() in watchdog_check_hardlockup() */
> +	smp_wmb();
> +	cpumask_clear_cpu(cpu, &watchdog_cpus);
> +}
> +

Best Regards,
Petr