linux-kernel - Re: [BUG] 8e274732115f ("printk: extend console

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220613042937.GZ1790663@paulmck-ThinkPad-P17-Gen-1>
Date:   Sun, 12 Jun 2022 21:29:37 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     John Ogness <john.ogness@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, frederic@...nel.org,
        pmladek@...e.com, Sergey Senozhatsky <senozhatsky@...omium.org>,
        Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [BUG] 8e274732115f ("printk: extend console_lock for per-console
 locking")

On Mon, Jun 13, 2022 at 04:18:56AM +0206, John Ogness wrote:
> On 2022-06-10, "Paul E. McKenney" <paulmck@...nel.org> wrote:
> > I have started getting rcutorture shutdown-time hangs when running
> > against recent mainline, and bisected back to the 8e274732115f ("printk:
> > extend console_lock for per-console locking") commit.  These hangs go
> > away (or at least their probability drops dramatically) if I build with
> > CONFIG_PREEMPTION=n -and- CONFIG_NO_HZ=y (not n!), at least assuming
> > that I also boot with "nohz_full=0-N".
> >
> > Attempts to debug using rcutorture's "--gdb" argument result in
> > "[Inferior 1 (process 1) exited normally]", but with the same truncated
> > console-log output as when running without "--gdb".  This suggests
> > that the kernel really did run to completion and halt as expected,
> > but that the shutdown-time printk() output was lost.  Furthermore, if I
> > use the gdb "hbreak" command to set a breakpoint at kernel_power_off(),
> > it really does hit that breakpoint.  This would not happen in the case
> > of a kernel hang.
> >
> > So, given that I can hit that breakpoint, what should I ask gdb to
> > show me?
> 
> If you also compile with CONFIG_GDB_SCRIPTS=y then you can use the
> defined "lx-dmesg" gdb command to see if the messages are in the
> ringbuffer.
> 
> (You may need to add:
> 
> add-auto-load-safe-path /path/to/linux/scripts/gdb/vmlinux-gdb.py
> 
> to your $HOME/.gdbinit)
> 
> But since you are hitting the breakpoint, the messages will be there.
> 
> > Alternatively, this reproduces on a variety of x86 platforms, so you
> > should be able reproduce it as follows [1]:
> >
> > 	git checkout v5.19-rc1
> > 	tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs "TREE01" --gdb --kconfig "CONFIG_DEBUG_INFO_NONE=n CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y" --trust-make
> >
> > This builds a kernel, boots it, and then tells you how to launch gdb
> > (presumably in some other window).  Once you give launch gdb and give
> > it the suggested commands, the kernel runs for two minutes under qemu,
> > then shuts down.  I used the following gdb commands to set the breakpoint
> > and run the kernel:
> >
> > 	target remote :1234  # suggested by the rcutorture script
> > 	hbreak kernel_power_off  # added by me
> > 	continue  # suggested by the rcutorture script
> 
> Thanks. Nice helper scripts. With this I could easily reproduce the
> issue.

Glad that they worked nicely for you!

> As I suspected, the final printk's cannot direct print because the
> kthread was printing. Using the below patch did seem to address your
> problem. But this is probably not the way forward.

When I apply it, I still lose output, perhaps due to different timing?
Doing the pr_flush(1000, true) just before the call to kernel_power_off()
has been working quite well thus far, though.

							Thanx, Paul

> What I have not figured out is why this problem does not exist when only
> the kthread patch (but not the "extend console_lock" patch) is
> applied. Somehow the console_lock is magically providing some sort of
> synchronization in the end. Or maybe it is just the increased lock
> contention that helps out.
> 
> It seems we need some sort of console_flush_on_panic() for non-panic
> situations that is not as violent as console_flush_on_panic().
> 
> @Petr, does it make sense to add the below patch to mainline? It is only
> marginally helpful because it performs the direct printing in the wrong
> context. Really we need that final pr_emerg("Power down\n") to be a
> successful direct print so that the buffer is fully flushed before
> hitting machine_power_off(). In fact, the below patch could make things
> worse because the printing kthread could take over printing from that
> final pr_emerg().
> 
> John Ogness
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ea3dd55709e7..5950586008fa 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3841,6 +3841,18 @@ static int printk_kthread_func(void *data)
>  		console_kthread_printing_exit();
>  
>  		mutex_unlock(&con->lock);
> +
> +		/*
> +		 * The kernel may have transitioned to a direct printing
> +		 * state, but any printk calls may not have direct printed
> +		 * because this thread was printing its message. Grab and
> +		 * release the console lock to flush out any pending
> +		 * messages on all consoles.
> +		 */
> +		if (allow_direct_printing()) {
> +			console_lock();
> +			console_unlock();
> +		}
>  	}
>  
>  	con_printk(KERN_INFO, con, "printing thread stopped\n");