[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220613042937.GZ1790663@paulmck-ThinkPad-P17-Gen-1>
Date: Sun, 12 Jun 2022 21:29:37 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: John Ogness <john.ogness@...utronix.de>
Cc: linux-kernel@...r.kernel.org, frederic@...nel.org,
pmladek@...e.com, Sergey Senozhatsky <senozhatsky@...omium.org>,
Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [BUG] 8e274732115f ("printk: extend console_lock for per-console
locking")
On Mon, Jun 13, 2022 at 04:18:56AM +0206, John Ogness wrote:
> On 2022-06-10, "Paul E. McKenney" <paulmck@...nel.org> wrote:
> > I have started getting rcutorture shutdown-time hangs when running
> > against recent mainline, and bisected back to the 8e274732115f ("printk:
> > extend console_lock for per-console locking") commit. These hangs go
> > away (or at least their probability drops dramatically) if I build with
> > CONFIG_PREEMPTION=n -and- CONFIG_NO_HZ=y (not n!), at least assuming
> > that I also boot with "nohz_full=0-N".
> >
> > Attempts to debug using rcutorture's "--gdb" argument result in
> > "[Inferior 1 (process 1) exited normally]", but with the same truncated
> > console-log output as when running without "--gdb". This suggests
> > that the kernel really did run to completion and halt as expected,
> > but that the shutdown-time printk() output was lost. Furthermore, if I
> > use the gdb "hbreak" command to set a breakpoint at kernel_power_off(),
> > it really does hit that breakpoint. This would not happen in the case
> > of a kernel hang.
> >
> > So, given that I can hit that breakpoint, what should I ask gdb to
> > show me?
>
> If you also compile with CONFIG_GDB_SCRIPTS=y then you can use the
> defined "lx-dmesg" gdb command to see if the messages are in the
> ringbuffer.
>
> (You may need to add:
>
> add-auto-load-safe-path /path/to/linux/scripts/gdb/vmlinux-gdb.py
>
> to your $HOME/.gdbinit)
>
> But since you are hitting the breakpoint, the messages will be there.
>
> > Alternatively, this reproduces on a variety of x86 platforms, so you
> > should be able reproduce it as follows [1]:
> >
> > git checkout v5.19-rc1
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs "TREE01" --gdb --kconfig "CONFIG_DEBUG_INFO_NONE=n CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y" --trust-make
> >
> > This builds a kernel, boots it, and then tells you how to launch gdb
> > (presumably in some other window). Once you give launch gdb and give
> > it the suggested commands, the kernel runs for two minutes under qemu,
> > then shuts down. I used the following gdb commands to set the breakpoint
> > and run the kernel:
> >
> > target remote :1234 # suggested by the rcutorture script
> > hbreak kernel_power_off # added by me
> > continue # suggested by the rcutorture script
>
> Thanks. Nice helper scripts. With this I could easily reproduce the
> issue.
Glad that they worked nicely for you!
> As I suspected, the final printk's cannot direct print because the
> kthread was printing. Using the below patch did seem to address your
> problem. But this is probably not the way forward.
When I apply it, I still lose output, perhaps due to different timing?
Doing the pr_flush(1000, true) just before the call to kernel_power_off()
has been working quite well thus far, though.
Thanx, Paul
> What I have not figured out is why this problem does not exist when only
> the kthread patch (but not the "extend console_lock" patch) is
> applied. Somehow the console_lock is magically providing some sort of
> synchronization in the end. Or maybe it is just the increased lock
> contention that helps out.
>
> It seems we need some sort of console_flush_on_panic() for non-panic
> situations that is not as violent as console_flush_on_panic().
>
> @Petr, does it make sense to add the below patch to mainline? It is only
> marginally helpful because it performs the direct printing in the wrong
> context. Really we need that final pr_emerg("Power down\n") to be a
> successful direct print so that the buffer is fully flushed before
> hitting machine_power_off(). In fact, the below patch could make things
> worse because the printing kthread could take over printing from that
> final pr_emerg().
>
> John Ogness
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ea3dd55709e7..5950586008fa 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3841,6 +3841,18 @@ static int printk_kthread_func(void *data)
> console_kthread_printing_exit();
>
> mutex_unlock(&con->lock);
> +
> + /*
> + * The kernel may have transitioned to a direct printing
> + * state, but any printk calls may not have direct printed
> + * because this thread was printing its message. Grab and
> + * release the console lock to flush out any pending
> + * messages on all consoles.
> + */
> + if (allow_direct_printing()) {
> + console_lock();
> + console_unlock();
> + }
> }
>
> con_printk(KERN_INFO, con, "printing thread stopped\n");
Powered by blists - more mailing lists