linux-kernel - Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZMthXBpLzbbysTe5@alley>
Date:   Thu, 3 Aug 2023 10:12:12 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Doug Anderson <dianders@...omium.org>
Cc:     Michal Hocko <mhocko@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel test robot <lkp@...el.com>,
        Lecopzer Chen <lecopzer.chen@...iatek.com>,
        Pingfan Liu <kernelfans@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in
 watchdog_hardlockup_check()

On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> Hi,
> 
> On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > [...]
> > > Ah, I see what you mean. The one issue I have with your solution is
> > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > With your solution the locked up CPU will be interspersed with all the
> > > others and will be harder to find in the output (you've got to match
> > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > While that's probably not a huge deal, it's nicer to make the output
> > > easy to understand for someone trying to parse it...
> >
> > Is it worth to waste memory for this arguably nicer output? Identifying
> > the stack of the locked up CPU is trivial.
> 
> I guess it's debatable, but as someone who has spent time staring at
> trawling through reports generated like this, I'd say "yes", it's
> super helpful in understanding the problem to have the hung CPU first.
> Putting the memory usage in perspective:

nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
What about changing the @exclude_self parameter to @exclude_cpu
and do:

	if (exclude_cpu >= 0)
		cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));


It would require changing also arch_trigger_cpumask_backtrace() to

	void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
				    int exclude_cpu);

but it looks doable.


It might theoretically change the behavior because
nmi_trigger_cpumask_backtrace() does

	this_cpu = get_cpu();

I guess that it was used to make sure that smp_processor_id()
the same. But it is too late. It should ignore the callers CPU.
So, this might actually fix a possible race.

Note that get_cpu() also disabled preemption. IMHO, it is not strictly needed.
But it might make sense to keep it because the backtraces are printed
when something goes wrong and it should print backtraces from the current state.

Best Regards,
Petr