lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZMthXBpLzbbysTe5@alley>
Date:   Thu, 3 Aug 2023 10:12:12 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Doug Anderson <dianders@...omium.org>
Cc:     Michal Hocko <mhocko@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel test robot <lkp@...el.com>,
        Lecopzer Chen <lecopzer.chen@...iatek.com>,
        Pingfan Liu <kernelfans@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in
 watchdog_hardlockup_check()

On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> Hi,
> 
> On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > [...]
> > > Ah, I see what you mean. The one issue I have with your solution is
> > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > With your solution the locked up CPU will be interspersed with all the
> > > others and will be harder to find in the output (you've got to match
> > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > While that's probably not a huge deal, it's nicer to make the output
> > > easy to understand for someone trying to parse it...
> >
> > Is it worth to waste memory for this arguably nicer output? Identifying
> > the stack of the locked up CPU is trivial.
> 
> I guess it's debatable, but as someone who has spent time staring at
> trawling through reports generated like this, I'd say "yes", it's
> super helpful in understanding the problem to have the hung CPU first.
> Putting the memory usage in perspective:

nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
What about changing the @exclude_self parameter to @exclude_cpu
and do:

	if (exclude_cpu >= 0)
		cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));


It would require changing also arch_trigger_cpumask_backtrace() to

	void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
				    int exclude_cpu);

but it looks doable.


It might theoretically change the behavior because
nmi_trigger_cpumask_backtrace() does

	this_cpu = get_cpu();

I guess that it was used to make sure that smp_processor_id()
the same. But it is too late. It should ignore the callers CPU.
So, this might actually fix a possible race.

Note that get_cpu() also disabled preemption. IMHO, it is not strictly needed.
But it might make sense to keep it because the backtraces are printed
when something goes wrong and it should print backtraces from the current state.

Best Regards,
Petr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ