lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c33acfb8-0078-46e2-b3a3-f753909749c9@paulmck-laptop>
Date: Tue, 13 May 2025 10:09:51 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Petr Mladek <pmladek@...e.com>
Cc: Lance Yang <lance.yang@...ux.dev>,
	Feng Tang <feng.tang@...ux.alibaba.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
	mhiramat@...nel.org, llong@...hat.com,
	John Ogness <john.ogness@...utronix.de>,
	Sergey Senozhatsky <senozhatsky@...omium.org>,
	Tomasz Figa <tfiga@...omium.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Thomas Gleixner <tglx@...utronix.de>,
	Michal Hocko <mhocko@...e.com>, Tejun Heo <tj@...nel.org>,
	Douglas Anderson <dianders@...omium.org>
Subject: Re: [PATCH v1 0/3] generalize panic_print's dump function to be used
 by other kernel parts

On Tue, May 13, 2025 at 03:27:33PM +0200, Petr Mladek wrote:
> On Mon 2025-05-12 16:23:30, Lance Yang wrote:
> > 
> > 
> > On 2025/5/12 11:14, Feng Tang wrote:
> > > Hi Andrew,
> > > 
> > > Thanks for the review!
> > > 
> > > On Sun, May 11, 2025 at 06:46:17PM -0700, Andrew Morton wrote:
> > > > On Sun, 11 May 2025 16:52:51 +0800 Feng Tang <feng.tang@...ux.alibaba.com> wrote:
> > > > 
> > > > > When working on kernel stability issues, panic, task-hung and
> > > > > software/hardware lockup are frequently met. And to debug them, user
> > > > > may need lots of system information at that time, like task call stacks,
> > > > > lock info, memory info etc.
> > > > > 
> > > > > panic case already has panic_print_sys_info() for this purpose, and has
> > > > > a 'panic_print' bitmask to control what kinds of information is needed,
> > > > > which is also helpful to debug other task-hung and lockup cases.
> > > > > 
> > > > > So this patchset extract the function out, and make it usable for other
> > > > > cases which also need system info for debugging.
> > > > > 
> > > > > Locally these have been used in our bug chasing for stablility issues
> > > > > and was helpful.
> > > > 
> > > > Truth.  Our responses to panics, oopses, WARNs, BUGs, OOMs etc seem
> > > > quite poorly organized.  Some effort to clean up (and document!) all of
> > > > this sounds good.
> > > > 
> > > > My vote is to permit the display of every scrap of information we can
> > > > think of in all situations.  And then to permit users to select which of
> > > > that information is to be displayed under each situation.
> > 
> > Completely agreed. The tricky part is making a global knob that works for
> > all situations without breaking userspace, but it's a better system-wide
> > approach ;)
> > 
> > > 
> > > Good point! Maybe one future todo is to add a gloabl system info dump
> > > function with ONE global knob for selecting different kinds of information,
> > > which could be embedded into some cases you mentioned above.
> > 
> > IMHO, for features with their own knobs, we need:
> > a) The global knob (if enabled) turns on all related feature-level knobs,
> > b) while still allowing users to manually override individual knobs.
> > 
> > Something like:
> > 
> > If SYS_PRINT_ALL_CPU_BT (global knob) is on, it enables
> > hung_task_all_cpu_backtrace
> > for hung-task situation automatically. But users can still disable it via
> > hung_task_all_cpu_backtrace.
> 
> I am all for unifying the options for printing debug information
> in various emergency situations. I am just not sure whether we really
> want to do the same in all situations.
> 
> Some lockup detectors tries to be more clever, for example:
> 
>   + RCU stall detector prints backtraces only from CPUs which are
>     involved in the stall, see print_other_cpu_stall().
> 
>   + Workqueues watchdog shows backtraces from tasks which are
>     preventing forward progress, see show_cpu_pool_hog().
> 
> And stalls are about scheduling (disabled preemption, disabled IRQ,
> deadlocks, too long uninterruptible sleep). OOM is about memory
> usage. Oops is about an invalid memory access. WARNs() are
> completely random stuff.
> 
> Also I am afraid of printing too much information when the system
> is supposed to continue running. It would make sense to print it in
> nbcon_cpu_emergency_enter()/exit() context which disables
> preemption. And it might cause softlockups on its own.

And we did do some of the cleverness that Petr points out because of
problems caused by flooding the console log.  We first ran into this
sort of thing on embedded systems with slow serial consoles (where 115K
baud is now way slow), but it also shows up in other environments, for
example, those committing large numbers of console logs to stable storage,
multiplexing large numbers of logs across networks that sometimes get
congested, and so on.

So I second the call for individual knobs, either in addition to or
instead of the global knob.

> Finally, I wonder whether ftrace_dump() might cause a livelock when ftrace
> is adding new messages in parallel.

It definitely can cause problems, and me learning this the hard way is
why rcutorture calls tracing_off() before calling ftrace_dump().

> The situation is much easier during panic() because the system is
> going to die() anyway, non-panic CPUs are stopped, ...
> 
> That said, I could understand that people might want to see as much
> information as possible when the console is fast and the range of
> possible problems is big.

No argument here.

							Thanx, Paul

> Anyway, I have added few more people into Cc who are interested into
> the various watchdogs.
> 
> And there is parallel initiative which tries to unify the loglevel or
> somehow make the filtering easier, see
> https://lore.kernel.org/r/20250424070436.2380215-1-senozhatsky@chromium.org
> 
> Best Regards,
> Petr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ