linux-kernel - Re: [GIT PULL] printk for 6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whU_woFnFN-3Jv2hNCmwLg_fkrT42AWwxm-=Ha5BmNX4w@mail.gmail.com>
Date: Tue, 23 Jul 2024 11:04:58 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Petr Mladek <pmladek@...e.com>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, Steven Rostedt <rostedt@...dmis.org>, 
	John Ogness <john.ogness@...utronix.de>, 
	Andy Shevchenko <andriy.shevchenko@...ux.intel.com>, 
	Rasmus Villemoes <linux@...musvillemoes.dk>, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Thomas Gleixner <tglx@...utronix.de>, Jan Kara <jack@...e.cz>, 
	Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] printk for 6.11

On Tue, 23 Jul 2024 at 07:38, Petr Mladek <pmladek@...e.com> wrote:
>
>   - In an emergency section, directly from nbcon_cpu_emergency_exit()
>     or nbcon_cpu_emergency_flush(). It allows to see the messages
>     when the system is in an unexpected state and might not be
>     able to continue properly.
>
>     The messages are flushed at the end of the emergency section
>     to allow storing the full log (backtrace) first.

What? No.

One of the historically problematic situations is when a recursive
oops or a deadlock occurs *during* the first oops.

The "recursive oops" may be simple to sort out by forcing a flush at
that point, in that hopefully the machine is "alive", but what about
random deadlocks or other situations where the printk machinery simply
is never ever entered again?

And we most definitely have had exactly that happen due to the call
trace code etc.

At that point, it's ok if the machine is dead (this is obviously a
very catastrophic situation - nobody should worry about how to
continue), but it's really important that the first problem report
makes it out.

The whole notion of "to allow storing the full log (backtrace) first"
is completely crazy. It's entirely secondary whether you have a full
log or not, when the primary goal MUST BE that you have any output at
all!

How can this have _continued_ to be unclear, when it was my one hard
requirement for this whole thing from day one? My *ONE* requirement
has always been that the printk code ALWAYS does its absolute best to
print out problem reports.

Because when an oops happen, all other rules go out the window.

We no longer care about "what pretty printouts", and we should strive
to always try to just get at least *some* basic print out. The kernel
is known to not be in a great state, and maybe the printout will fail
due to where the problem happened, but the kernel NEEDS TO TRY.

           Linus