lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZshiIdUFQs4CKW2t@pathway.suse.cz>
Date: Fri, 23 Aug 2024 12:19:13 +0200
From: Petr Mladek <pmladek@...e.com>
To: Derek Barbosa <debarbos@...hat.com>
Cc: pmaldek@...e.com, williams@...hat.com, john.ogness@...utronix.de,
	tglx@...utronix.de, linux-rt-users@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: test 1: was: Re: A Comparison of printk between upstream and
 linux-rt-devel

On Thu 2024-08-22 12:32:15, Derek Barbosa wrote:
> Hi,
> 
> TLDR: plain, vanilla 6.11.0-0.rc3 is slower on flush and 
> does not print traces in panic/crash context consistently.
> 
> 
> The purpose of this email is to share some findings with regards to the latest
> available printk changes, in comparison to what is currently available in the
> "mainline" upstream torvalds tree.
> 
> Specifically, there was concern regarding flushing, flushing speed, and ensuring
> that viable information can be displayed to the user in critical context. This
> email also assumes that [0] (and the rest of the thread) has been previously read.
> 
> Moving on, I've been testing the printk code present in the linux-rt-devel tree
> for some time, and have been honing in on comparing behaviors/interactions
> between a stock, regular kernel and the linux-rt-devel tree. 
> 
> The kernels in question are the following:
> 
> 1. a stock torvalds kernel, 6.11.0-0.rc3 
> 2. a linux-rt-devel kernel, 6.11.0-0.rc3-rt2, which has the "newer" printk code
> 
> As a note, 6.11.0-0.rc3-rt2 DOES NOT HAVE CONFIG_PREEMPT_RT ENABLED.
> 
> I will refer to these kernels as "new printk" vs "stock printk".
> 
> I've also attached the configs for these kernels.

Could you please also share the kernel command line? I can't find it
anywhere.

Especially I am interested whether it:

  + wanted to show backtraces on all CPUs via "panic_print" parameter.
  + did a crashdump or a reboot.
  + used also another console (graphics).

> --- Test 1: John Ogness' Console Blast. ---
> 
> This test uses a script which calls itself to create a pinned process for each CPU. Those
> child processes will run in infinite loops of show-task-states via
> /proc/sysrq-trigger. This generates lots of contention on the console. After
> some time, we use the sysrq-trigger to crash the machine. 
> 
> The success condition would be to be able to view the full crash backtrace via
> the serial console. 
> 
> For each of the kernels, 10 back-to-back trials were performed. 
> 
> In the 6.11.0-0.rc3 stock kernel, we did *not* observe a trace on crash. There were various
> other traces scattered/nested throughout the show-task-state noise, but no full
> crash backtrace. At times, there were upwards of 13k dropped messages.

Do you miss the backtrace from the panic-CPU or non-panic-CPUs or
both?

The dump of the backtraces on non-panic-CPUs might have been affected
by the regression fixed earlier this week via
https://lore.kernel.org/r/20240812072703.339690-1-takakura@valinux.co.jp

Did the system reboot in the end?
Or does it got stuck somewhere?

> In the 6.11.0-0.rc3-rt2 "new printk" kernel, we observed the success condition on each run. At
> the "end" of the test (the crash), the full call trace was visible and presented
> to us via the serial console.

I guess that it is not the problem with the non-panic CPUs because
v6.11-rc3-rt2 in rt/linux-rt-devel.git seems to have the same regression.

It is great to see that the serial console driver transformed into
the new nbcon console is so reliable.

Still, it is strange that the stock kernel is so bad in this test.
console_flush_on_panic() ignores both console_lock and port->lock.
There should be a good chance to see the messages. It might break
"only" when the console driver has been stopped on a non-panic
CPU in a state which would prevent the panic CPU use the driver
even when locks are ignored. Well, the chance of a breakage
is likely bigger when the messages are flushed also on
the graphics console.

Anyway, thanks a lot for the testing and sharing the results.

Best Regards,
Petr

PS: I still have to think about the other results. But they seem to
    be less surprising. I am most curious about the so bad behavior
    of the stock kernel in the first test. I hope that we did not
    break something in the patch handling the legacy consoles.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ