lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v8t5l39z.fsf@jogness.linutronix.de>
Date:   Mon, 13 Jun 2022 04:18:56 +0206
From:   John Ogness <john.ogness@...utronix.de>
To:     paulmck@...nel.org
Cc:     linux-kernel@...r.kernel.org, frederic@...nel.org,
        pmladek@...e.com, Sergey Senozhatsky <senozhatsky@...omium.org>,
        Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [BUG] 8e274732115f ("printk: extend console_lock for
 per-console locking")

On 2022-06-10, "Paul E. McKenney" <paulmck@...nel.org> wrote:
> I have started getting rcutorture shutdown-time hangs when running
> against recent mainline, and bisected back to the 8e274732115f ("printk:
> extend console_lock for per-console locking") commit.  These hangs go
> away (or at least their probability drops dramatically) if I build with
> CONFIG_PREEMPTION=n -and- CONFIG_NO_HZ=y (not n!), at least assuming
> that I also boot with "nohz_full=0-N".
>
> Attempts to debug using rcutorture's "--gdb" argument result in
> "[Inferior 1 (process 1) exited normally]", but with the same truncated
> console-log output as when running without "--gdb".  This suggests
> that the kernel really did run to completion and halt as expected,
> but that the shutdown-time printk() output was lost.  Furthermore, if I
> use the gdb "hbreak" command to set a breakpoint at kernel_power_off(),
> it really does hit that breakpoint.  This would not happen in the case
> of a kernel hang.
>
> So, given that I can hit that breakpoint, what should I ask gdb to
> show me?

If you also compile with CONFIG_GDB_SCRIPTS=y then you can use the
defined "lx-dmesg" gdb command to see if the messages are in the
ringbuffer.

(You may need to add:

add-auto-load-safe-path /path/to/linux/scripts/gdb/vmlinux-gdb.py

to your $HOME/.gdbinit)

But since you are hitting the breakpoint, the messages will be there.

> Alternatively, this reproduces on a variety of x86 platforms, so you
> should be able reproduce it as follows [1]:
>
> 	git checkout v5.19-rc1
> 	tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs "TREE01" --gdb --kconfig "CONFIG_DEBUG_INFO_NONE=n CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y" --trust-make
>
> This builds a kernel, boots it, and then tells you how to launch gdb
> (presumably in some other window).  Once you give launch gdb and give
> it the suggested commands, the kernel runs for two minutes under qemu,
> then shuts down.  I used the following gdb commands to set the breakpoint
> and run the kernel:
>
> 	target remote :1234  # suggested by the rcutorture script
> 	hbreak kernel_power_off  # added by me
> 	continue  # suggested by the rcutorture script

Thanks. Nice helper scripts. With this I could easily reproduce the
issue.

As I suspected, the final printk's cannot direct print because the
kthread was printing. Using the below patch did seem to address your
problem. But this is probably not the way forward.

What I have not figured out is why this problem does not exist when only
the kthread patch (but not the "extend console_lock" patch) is
applied. Somehow the console_lock is magically providing some sort of
synchronization in the end. Or maybe it is just the increased lock
contention that helps out.

It seems we need some sort of console_flush_on_panic() for non-panic
situations that is not as violent as console_flush_on_panic().

@Petr, does it make sense to add the below patch to mainline? It is only
marginally helpful because it performs the direct printing in the wrong
context. Really we need that final pr_emerg("Power down\n") to be a
successful direct print so that the buffer is fully flushed before
hitting machine_power_off(). In fact, the below patch could make things
worse because the printing kthread could take over printing from that
final pr_emerg().

John Ogness

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea3dd55709e7..5950586008fa 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3841,6 +3841,18 @@ static int printk_kthread_func(void *data)
 		console_kthread_printing_exit();
 
 		mutex_unlock(&con->lock);
+
+		/*
+		 * The kernel may have transitioned to a direct printing
+		 * state, but any printk calls may not have direct printed
+		 * because this thread was printing its message. Grab and
+		 * release the console lock to flush out any pending
+		 * messages on all consoles.
+		 */
+		if (allow_direct_printing()) {
+			console_lock();
+			console_unlock();
+		}
 	}
 
 	con_printk(KERN_INFO, con, "printing thread stopped\n");

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ