linux-kernel - Re: [PATCH printk v4 17/27] printk: nbcon: Use nbcon consoles in console_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZiEWaA3CeQsccEdj@pathway.suse.cz>
Date: Thu, 18 Apr 2024 14:47:36 +0200
From: Petr Mladek <pmladek@...e.com>
To: John Ogness <john.ogness@...utronix.de>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH printk v4 17/27] printk: nbcon: Use nbcon consoles in
 console_flush_all()

On Thu 2024-04-18 01:11:59, John Ogness wrote:
> On 2024-04-11, Petr Mladek <pmladek@...e.com> wrote:
> > I am trying to make a full picture when and how the nbcon consoles
> > will get flushed. My current understanding and view is the following,
> > starting from the easiest priority:
> >
> >
> >   1. NBCON_PRIO_PANIC messages will be flushed by calling
> >      nbcon_atomic_flush_pending() directly in vprintk_emit()
> >
> >      This will take care of any previously added messages.
> >
> >      Non-panic CPUs are not allowed to add messages anymore
> >      when there is a panic in progress.
> >
> >      [ALL OK]
> 
> OK, because the end of panic will perform unsafe takeovers if necessary.
> 
> >   2. NBCON_PRIO_EMERGENCY messages will be flushed by calling
> >      nbcon_atomic_flush_pending() directly in nbcon_cpu_emergency_exit().
> >
> >      This would cover all previously added messages, including
> >      the ones printed by the code between
> >      nbcon_cpu_emergency_enter()/exit().
> 
> This is best effort. If the console is owned by another context and is
> marked unsafe, nbcon_atomic_flush_pending() will do nothing.
> 
> [ PROBLEM: In this case, who will flush the emergency messages? ]

They should get flushed by the current owner when the system is still
living. Or the system is ready for panic() and the messages would
be emitted bu the final unsafe flush.

IMPORTANT: The optimistic scenario would work only when the current
	   owner really flushes everything. More on this below.


> >      This won't cover later added messages which might be
> >      a problem. Let's look at this closer. Later added
> >      messages with:
> >
> > 	+ NBCON_PRIO_PANIC will be handled in vprintk_emit()
> > 	  as explained above [OK]
> >
> > 	+ NBCON_PRIO_EMERGENCY() will be handled in the
> > 	  related nbcon_cpu_emergency_exit() as described here.
> > 	  [OK]
> >
> > 	+ NBCON_PRIO_NORMAL will be handled, see below. [?]
> >
> >      [ PROBLEM: later added NBCON_PRIO_NORMAL messages, see below. ]
> 
> Yes, this is also an issue, although the solution may be the same for
> this and the above problem.

This is a bit different. There was an existing console owner in the
above scenario. In this case, the code relies on the printk kthread.
But we need a solution for situations when the kthread is not working,
e.g. early boot.


> >   3. NBCON_PRIO_NORMAL messages will be flushed by:
> >
> >        + the printk kthread when it is available
> >
> >        + the legacy loop via
> >
> > 	 + console_unlock()
> > 	    + console_flush_all()
> > 	      + console nbcon_legacy_emit_next_record() [PROBLEM]
> >
> > PROBLEM: console_flush_all() does not guarantee progress with
> > 	 nbcon consoles as explained above (previous mail).
> 
> Not only this. If there is no kthread available, no printing will occur
> until the _next_ printk(), whenever that is.

> Above we have listed 3 problems:
> 
> - emergency messages will not flush if owned by another context and
>   unsafe
>
> - normal messages will not flush if owned by another context
> 
> - for the above 2 problems, if there is no kthread, nobody will flush
>   the messages

All this goes down to the problem who is would flush "ignored"
messages when the system continues working in "normal" mode.


> My question: Is this really a problem?

IMHO, it is. For example, early boot consoles exists for a reason.
People want to debug early boot problems using printk.
We should not break the reliability too much by introducing kthreads.

Later update: It is basically only about early boot debugging.

	The kthreads should always be created later. And
	we assume that they work, except for the emergency
	and panic context.

	So, this is not a problem as long as the boot consoles
	are using the legacy code paths.

	Well, I guess that they might use the "atomic_write()"
	callback in the future. And then this "bug" might hurt.


> The main idea behind the rework is that printing is deferred. The
> kthreads exist for this. If the kthreads are not available (early boot
> or shutdown) or the kthreads are not reliable enough (emergency
> messages), a best-safe-effort is made to print in the caller
> context. Only the panic situation is designed to force output (unsafely,
> if necessary). Is that not enough?

Simple answer: No, primary because the early boot behavior.

Longer answer: I tried to separate all the variants and point out
	a particular problem. The above paragraph mixes everything
	into "Wave this away" proposal.


> > My proposal:
> >
> > 	1. console_flush_all() will flush nbcon consoles only
> > 	   in NBCON_PRIO_NORMAL and when the kthreads are not
> > 	   available.
> >
> > 	   It will make it clear that this is the flusher in
> > 	   this situation.
> 
> This is the current PREEMPT_RT implementation.
> 
> > 	2. Allow to skip nbcon consoles in console_flush_all() when
> > 	   it can't take the context (as suggested in my previous
> > 	   reply).
> >
> > 	   This won't guarantee flushing NORMAL messages added
> > 	   while nbcon_cpu_emergency_exit() calls
> > 	   nbcon_atomic_flush_pending().
> 
> This was the previous version. And I agree that we need to go back to
> that.
> 
> > 	   Solve this problem by introducing[*] nbcon_atomic_flush_all()
> > 	   which would flush even newly added messages and
> > 	   call this in nbcon_cpu_emergency_exit() when the printk
> > 	   kthread does not work. It should bail out when there
> > 	   is a panic in progress.
> >
> > 	   Motivation: It does not matter which "atomic" context
> > 		flushes NORMAL/EMERGENCY messages when
> > 		the printk kthread is not available.
> 
> I do not think that solves the problem. If the console is in an unsafe
> section, nothing can be printed.

IMHO, it solves the problem.

The idea is simple:

  "The current nbcon console owner will be responsible for flushing
   all messages when the printk kthread does not exist."

The prove is more complicated:

   1. Let's put aside panic. We already do the best effort there.

   2. Emergency mode currently violates the rule because
      nbcon_atomic_flush_pending() ignores the simple rule.

      => FIX: improve nbcon_cpu_emergency_exit() to flush
	      all messages when kthreads are not ready.


   3. Normal mode flushes nbcon consoles via
      nbcon_legacy_emit_next_record() from console_unlock()
      before the kthreads are started.

      It is not reliable when nbcon_try_acquire() fails.
      But it would fail only when there is another user.
      The other owner might be:

	+ panic: will handle everything

	+ emergency: should flush everything [*]

	+ normal: can't happen because of con->device() lock.

=> The only remaining problem is to fix nbcon_atomic_flush_pending()
   to flush everything when the kthreads are not started yet.


> > 	  [*] Alternatively we could modify nbcon_atomic_flush_pending()
> > 	      to flush even newly added messages when the kthread is
> > 	      not working. But it might create another mess.
> 
> This discussion is about when kthreads are not available. If this is a
> concern, I wonder if maybe in this situation an irq_work should be
> triggered upon release of the console.

Calling irq_work() would solve the problem as well. It would move
flushing to context with "normal" priority which is serialized
by con->device_lock(). It works for me.

Does this make any sense?

It is possible that you already knew all this. And it is possible
that you did not see it as a problem because there was no plan
to convert boot consoles to nbcon variant. Maybe, it does
not even make sense because boot consoles could not use
kthreads. The only motivation would be code sharing and
simplification of the legacy loop but this is far away dream.

Sigh, all this is so complicated. I wonder how to document
this so that other people do not have to discover these
dependencies again and again. Is it even possible?

Well, I still think that it makes sense to improve
nbcon_cpu_emergency_exit() to fill the potential hole.
And ideally mention all these details somewhere
(commit message, comments, Documentation/...)

Best Regards,
Petr