linux-kernel - Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZQ3R4Lz1LHQYsylw@alley>
Date:   Fri, 22 Sep 2023 19:41:52 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     John Ogness <john.ogness@...utronix.de>
Cc:     Sergey Senozhatsky <senozhatsky@...omium.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console
 atomic flushing

On Wed 2023-09-20 01:14:51, John Ogness wrote:
> From: Thomas Gleixner <tglx@...utronix.de>
> 
> Perform nbcon console atomic flushing at key call sites:
> 
> nbcon_atomic_exit() - When exiting from the outermost atomic
> 	printing section.

IMHO, it would not help because it all depends
whether nbcon_context_try_acquire() succeeds or now, see below.

> 	If the priority was NBCON_PRIO_PANIC,
> 	attempt a second flush allowing unsafe hostile
> 	takeovers.

I was first scared that this would be called by any printk()
in panic(). But it seems that nbcon_atomic_exit() is called
only one after disabling printk(). It might deserve a comment
where it is supposed to be used. See a proposal below.


> console_flush_on_panic() - Called from several call sites to
> 	trigger ringbuffer dumping in an urgent situation.
> 
> console_flush_on_panic() - Typically the panic() function will

This is a second description of console_flush_of_panic() which
looks like a mistake.

> 	take care of atomic flushing the nbcon consoles on
> 	panic. However, there are several users of
> 	console_flush_on_panic() outside of panic().

The generic panic() seems to use console_flush_on_panic() correctly
at the very end.

Hmm, I see that console_flush_on_panic() is called also in
arch/loongarch/kernel/reset.c and arch/powerpc/kernel/traps.c.

The loongarch code uses it in halt(). IMHO, it would be
better to use the normal flush. But maybe they accept the risk.
I know nothing about this architecture and who uses it.

The function defined on powerpc is then used in:

  + arch/powerpc/platforms/powernv/opal.c:

     IMHO, it should be replaced there by normal flush. It seems
     to call the generic panic() later.

  + arch/powerpc/platforms/ps3/setup.c:

      This seems to be used as a panic notifier ppc_panic_platform_handler().
      I am not completely sure but it does not look like the final flush.

  + arch/powerpc/platforms/pseries/setup.c:

      Same as ps3/setup.c. Also "just" a panic notifier.

Anyway, we should make clear that console_flush_on_panic() might break
the system and should be called as the last attempt to flush consoles.
The above arch-specific users are worth review.

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1092,6 +1092,11 @@ void nbcon_atomic_flush_all(void)
>   * Return:	The previous priority that needs to be fed into
>   *		the corresponding nbcon_atomic_exit()
>   * Context:	Any context. Disables migration.
> + *
> + * When within an atomic printing section, no atomic printing occurs. This
> + * is to allow all emergency messages to be dumped into the ringbuffer before
> + * flushing the ringbuffer.

The comment sounds like it is an advantage. But I think that it would be
a disadvantage.

Fortunately, this is not true. Even the atomic context tries
to flush the messages immediately when it is able to get
the per-cpu lock. It happes via console_flush_all() called
from console_unlock().

> + * The actual atomic printing occurs when exiting
> + * the outermost atomic printing section.
>   */
>  enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio)
>  {
> @@ -1130,8 +1135,13 @@ void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
>  {
>  	struct nbcon_cpu_state *cpu_state;
>  
> +	__nbcon_atomic_flush_all(false);

IMHO, this would not help. Either this context was able to acquire the
lock and flush each message directly. Or it would fail to get the lock
even here.

> +
>  	cpu_state = nbcon_get_cpu_state();
>  
> +	if (cpu_state->prio == NBCON_PRIO_PANIC)
> +		__nbcon_atomic_flush_all(true);

It would deserve a comment that nbcon_atomic_exit() is used
in panic() to calm down the consoles completely, similar effect
as setting the variable "suppress_printk".

> +
>  	/*
>  	 * Undo the nesting of nbcon_atomic_enter() at the CPU state
>  	 * priority.
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 6ef33cefa4d0..419c0fabc481 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3159,6 +3159,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
>  		console_srcu_read_unlock(cookie);
>  	}
>  
> +	nbcon_atomic_flush_all();
> +
>  	console_flush_all(false, &next_seq, &handover);

It seems that console_flush_all() tries to flush nbcon conosoles
as well. And nbcon_atomic_flush_all() is explicitely called
close above this flush_on_panic(). This is from panic()
after this patchset is applied.


void panic(const char *fmt, ...)
{
[...]
	nbcon_atomic_flush_all();

	console_unblank();

	/*
	 * We may have ended up stopping the CPU holding the lock (in
	 * smp_send_stop()) while still having some valuable data in the console
	 * buffer.  Try to acquire the lock then release it regardless of the
	 * result.  The release will also print the buffers out.  Locks debug
	 * should be disabled to avoid reporting bad unlock balance when
	 * panic() is not being callled from OOPS.
	 */
	debug_locks_off();
	console_flush_on_panic(CONSOLE_FLUSH_PENDING);


By other words, we try to flush nbcon consoles 3 times almost
immediately after each other. I believe that there is a motivation
to do so. Anyway, I want to make sure that it was on purpose.

It would deserve some comment what is the purpose. Otherwise, people
would tend to remove the "redundant" calls.

>  }
>  
> @@ -3903,6 +3905,10 @@ void defer_console_output(void)
>  
>  void printk_trigger_flush(void)
>  {
> +	migrate_disable();
> +	nbcon_atomic_flush_all();
> +	migrate_enable();

Can this be actually called in NMI?

> +
>  	defer_console_output();
>  }

This function would deserve some description because it looks strange.
There are three names which are contradicting each other:

  + trigger_flush:    it sounds like it tells someone to do the flush

  + nbcon_atomic_flush_all: does the flush immediately

  + defer_console_output: sounds like it prevents the current context
	to flush the consoles immediately

We should either choose better names and/or add comments:

/**
 * console_flush_or_trigger() - Make sure that the consoles will get flushed.
 *
 * Try to flush consoles when possible or queue flushing consoles like
 * in the deferred printk.
 *
 * Context: Can be used in any context (including NMI?)
 */
void printk_flush_or_trigger(void)
{
	/*
	 * Try to flush consoles which do not depend on console_lock()
	 * and support safe atomic_write()
	 */
	migrate_disable();
	nbcon_atomic_flush_all();
	migrate_enable();

	/* Try flushing legacy consoles or queue the deferred handling. */
	if (!in_nmi() && console_trylock())
		console_unlock();
	else
		defer_console_output();
}


All in all. I feel a bit overwhelmed by the many *flush*() functions at
the moment. Some flush only nbcon consoles. Some flush all. Some
are using only the safe takeover/handover and some allow also
harsh takeover. Some ignore port->lock because of bust_spinlock().
Some are ignoring console_lock. They are called on different
locations. The nbcon variants are called explicitly and also
inside the generic *flush*() functions.

It is partly because it is Friday evening. Anyway, I need to think
more about it.

Best Regards,
Petr