linux-kernel - Re: [PATCH v4] panic: add an option to replay all the printk message in buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190426141426.h7hpvhr3rqp7umbk@pathway.suse.cz>
Date:   Fri, 26 Apr 2019 16:14:26 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Sergey Senozhatsky <sergey.senozhatsky@...il.com>
Cc:     Feng Tang <feng.tang@...el.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
        linux-kernel@...r.kernel.org,
        Aaro Koskinen <aaro.koskinen@...ia.com>,
        Kees Cook <keescook@...omium.org>, Borislav Petkov <bp@...e.de>
Subject: Re: [PATCH v4] panic: add an option to replay all the printk message
 in buffer

On Fri 2019-04-26 22:53:16, Sergey Senozhatsky wrote:
> On (04/26/19 09:49), Petr Mladek wrote:
> > On Thu 2019-04-25 21:32:17, Feng Tang wrote:
> > > Currently on panic, kernel will lower the loglevel and print out
> > > pending printk msg only with console_flush_on_panic().
> > > 
> > > Add an option for users to configure the "panic_print" to replay
> > > all dmesg in buffer, some of which they may have never seen due
> > > to the loglevel setting, which will help panic debugging .
> > > 
> > > @@ -2539,6 +2540,11 @@ void console_flush_on_panic(void)
> > >  	 */
> > >  	console_trylock();
> > >  	console_may_schedule = 0;
> > > +
> > > +	if (mode == CONSOLE_REPLAY_ALL) {
> > > +		console_seq = log_first_seq;
> > > +		console_idx = log_first_idx;
> > 
> > Ah, log_first_seq and log_first_idx are synchronized by
> > logbuf_log.
> > 
> > console_flush_on_panic(CONSOLE_REPLAY_ALL) is called
> > when only one CPU is running but it is not guaranteed.
> > 
> > Therefore we should use:
> > 
> > 	if (mode == CONSOLE_REPLAY_ALL) {
> > 		unsigned long flags;
> > 
> > 		logbuf_lock_irqsave(flags);
> > 		console_seq = log_first_seq;
> > 		console_idx = log_first_idx;
> > 		logbuf_unlock_irqrestore(flags);
> > 	}
> 
> I thought about it, and I don't think I see how we can race with
> anything here.
> 
> Suppose we have panic on CPUA and cactive CPUB in console_unlock():
> 
> - if it's not in atomic context, then the moment it does
> 
> 	call_console_drivers();
> 	printk_safe_exit_irqrestore(flags);  << IPI
> 
>   IPI will take it down.
> 
> - If IPI doesn't take it down, then NMI will.

Then I wonder why, for example, native_stop_other_cpus() waits
10ms at maximum after sending the NMIs. What is the state
of the CPUs that miss this deadline?


> - But, more importantly, if that CPUB is in atomic context, then panic
>   CPUA will spin, waiting for that CPUB to handoff printing, before
>   panic CPU will even try to stop all CPUs.
> 
> 	pr_emerg("Kernel panic - not syncing: %s\n", buf)
> 
>   is the point of 'synchronization' - panic CPU will wait for
>   current console owner.

"Synchronization point" is too strong formulation.
The console waiter logic is effective but it does not always
work. The current console owner must be calling the console
drivers.


>   Hmm, we might have a bit of a problem here, maybe.

Hmm, the printk() might wait forever when NMI stopped
the current console owner in the console driver code
or with the logbuf_lock taken.

The console waiter logic might get solved by clearing
the console_owner in console_flush_on_panic(). It can't
be much worse, we already ignore console_lock() there, ...

The best solution for the logbuf_lock scenario would be
the lock less log buffer.


Anyway, do we really need to have length discussion about
whether the locks are needed? They will not break anything.
And we could ignore one scenario in case of problems.
Also we will not need to think about it when the
call is moved anywhere else.

Finally, I could imagine that people would want to replay
the log also from other locations when debugging nasty bugs.

Best Regards,
Petr