[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCNIxWaPVHywfek2@pathway.suse.cz>
Date: Tue, 13 May 2025 15:27:33 +0200
From: Petr Mladek <pmladek@...e.com>
To: Lance Yang <lance.yang@...ux.dev>
Cc: Feng Tang <feng.tang@...ux.alibaba.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
mhiramat@...nel.org, llong@...hat.com,
"Paul E. McKenney" <paulmck@...nel.org>,
John Ogness <john.ogness@...utronix.de>,
Sergey Senozhatsky <senozhatsky@...omium.org>,
Tomasz Figa <tfiga@...omium.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Mel Gorman <mgorman@...e.de>,
Thomas Gleixner <tglx@...utronix.de>,
Michal Hocko <mhocko@...e.com>, Tejun Heo <tj@...nel.org>,
Douglas Anderson <dianders@...omium.org>
Subject: Re: [PATCH v1 0/3] generalize panic_print's dump function to be used
by other kernel parts
On Mon 2025-05-12 16:23:30, Lance Yang wrote:
>
>
> On 2025/5/12 11:14, Feng Tang wrote:
> > Hi Andrew,
> >
> > Thanks for the review!
> >
> > On Sun, May 11, 2025 at 06:46:17PM -0700, Andrew Morton wrote:
> > > On Sun, 11 May 2025 16:52:51 +0800 Feng Tang <feng.tang@...ux.alibaba.com> wrote:
> > >
> > > > When working on kernel stability issues, panic, task-hung and
> > > > software/hardware lockup are frequently met. And to debug them, user
> > > > may need lots of system information at that time, like task call stacks,
> > > > lock info, memory info etc.
> > > >
> > > > panic case already has panic_print_sys_info() for this purpose, and has
> > > > a 'panic_print' bitmask to control what kinds of information is needed,
> > > > which is also helpful to debug other task-hung and lockup cases.
> > > >
> > > > So this patchset extract the function out, and make it usable for other
> > > > cases which also need system info for debugging.
> > > >
> > > > Locally these have been used in our bug chasing for stablility issues
> > > > and was helpful.
> > >
> > > Truth. Our responses to panics, oopses, WARNs, BUGs, OOMs etc seem
> > > quite poorly organized. Some effort to clean up (and document!) all of
> > > this sounds good.
> > >
> > > My vote is to permit the display of every scrap of information we can
> > > think of in all situations. And then to permit users to select which of
> > > that information is to be displayed under each situation.
>
> Completely agreed. The tricky part is making a global knob that works for
> all situations without breaking userspace, but it's a better system-wide
> approach ;)
>
> >
> > Good point! Maybe one future todo is to add a gloabl system info dump
> > function with ONE global knob for selecting different kinds of information,
> > which could be embedded into some cases you mentioned above.
>
> IMHO, for features with their own knobs, we need:
> a) The global knob (if enabled) turns on all related feature-level knobs,
> b) while still allowing users to manually override individual knobs.
>
> Something like:
>
> If SYS_PRINT_ALL_CPU_BT (global knob) is on, it enables
> hung_task_all_cpu_backtrace
> for hung-task situation automatically. But users can still disable it via
> hung_task_all_cpu_backtrace.
I am all for unifying the options for printing debug information
in various emergency situations. I am just not sure whether we really
want to do the same in all situations.
Some lockup detectors tries to be more clever, for example:
+ RCU stall detector prints backtraces only from CPUs which are
involved in the stall, see print_other_cpu_stall().
+ Workqueues watchdog shows backtraces from tasks which are
preventing forward progress, see show_cpu_pool_hog().
And stalls are about scheduling (disabled preemption, disabled IRQ,
deadlocks, too long uninterruptible sleep). OOM is about memory
usage. Oops is about an invalid memory access. WARNs() are
completely random stuff.
Also I am afraid of printing too much information when the system
is supposed to continue running. It would make sense to print it in
nbcon_cpu_emergency_enter()/exit() context which disables
preemption. And it might cause softlockups on its own.
Finally, I wonder whether ftrace_dump() might cause a livelock when ftrace
is adding new messages in parallel.
The situation is much easier during panic() because the system is
going to die() anyway, non-panic CPUs are stopped, ...
That said, I could understand that people might want to see as much
information as possible when the console is fast and the range of
possible problems is big.
Anyway, I have added few more people into Cc who are interested into
the various watchdogs.
And there is parallel initiative which tries to unify the loglevel or
somehow make the filtering easier, see
https://lore.kernel.org/r/20250424070436.2380215-1-senozhatsky@chromium.org
Best Regards,
Petr
Powered by blists - more mailing lists