lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87y2owwo2o.fsf@rub.de>
Date:   Tue, 09 Jun 2020 12:06:23 +0200
From:   Stephen Berman <stephen.berman@....net>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org
Subject: Re: power-off delay/hang due to commit 6d25be57 (mainline)

On Fri, 22 May 2020 18:40:12 +0200 Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:

> Sorry for the late reply.

No problem, since as it turned out, I didn't have to time till now to
follow up on your latest suggestions.  Details below.

> On 2020-05-14 23:39:40 [+0200], Stephen Berman wrote:
>> >> How will I know if that happens, is there a specific message in the tty?
>> >
>> > On the tty console where you see the "timing out command, waited"
>> > message, there should be something starting with
>> > |BUG: workqueue lockup - pool
>> >
>> > following with the pool information that got stuck. That code checks the
>> > workqueues every 30secs by default. So if you waited >= 60secs then
>> > system is not detecting a stall.
>>
>> As you can see in the photo, there was no message about a workqueue
>> lockup, only "task halt:5320 blocked for more than <XXX> seconds" every
>> two minutes.  I suppose that comes from one of the other options I
>> enabled.  Does it reveal anything about the problem?
>
>>>From the picture, you are on your way to level 0, which would issue the
> final shutdown command, but you are not quite there yet.
>
> If you add a printk() to the reboot syscall, then I wouldn't expect you
> to see it. (something like that):
>
> diff --git a/kernel/reboot.c b/kernel/reboot.c
> index c4d472b7f1b42..19bc35bc0cda0 100644
> --- a/kernel/reboot.c
> +++ b/kernel/reboot.c
> @@ -314,6 +314,7 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
>  	char buffer[256];
>  	int ret = 0;
>
> +	pr_err("%s(%d)CMD: %lx\n", __func__, __LINE__, cmd);
>  	/* We only trust the superuser with rebooting the system. */
>  	if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT))
>  		return -EPERM;
>
> If you add "ignore_loglevel initcall_debug" to the command line then you
> should see the init callbacks of each driver. But there will be nothing
> on your shutdown (as I expect it).
>
> The "task X blocked for more than 120 secs" is part of the hung task
> detector. With the "ignore_loglevel" above you should be able to see the
> callchain of the task. I suspect that the task poked the cd-drive which
> isn't answering. So from detector's point of view, the task issued a
> system call which appears to hang an makes no progress.
>
>> > Could
>> > you please check if the stall-dector says something?
>>
>> Is that the message I repeated above or do you mean the workqueue?
>
> The hung message is not workqueue related. It is the task `halt' that
> makes no progress. There is not stall of the workqueue as far as the
> system can tell.
> The two boot options, I suggested above, may reveal additional
> information that are printed but suppressed due to the loglevel.
>
> My guess now is that maybe shutting down wifi also paused the AHCI
> controller which makes no progress now. So booting without cdrom/disk
> should not cause any problems.
>
> Could you please:
> - try booting with "ignore_loglevel initcall_debug" and see if
>   additional information is printed on the console.
>
> - Remove cd / ATA-disk to check if anything else causes a stall. As by
>   your report you only mentioned those two (and if I see it correctly,
>   you rootFS is on nvme so removing the disk might be doable).
>
> - Could you remove the Wifi (just the driver, no the physical hw) to see
>   if it makes any difference?

I recompiled kernel 5.6.4 with the printk() call you suggested, then
booted the kernel with "ignore_loglevel initcall_debug" (but leaving the
CDROM and wifi intact for now).  After working as I normally do, I
called `shutdown -h now', again as usual.  After the "Bringing down the
loopback interface" message there were these two messages:

reboot: __do_sys_reboot(317)CMD: 89abcdef
reboot: __do_sys_reboot(317)CMD: 4321fedc

Then nothing more for two minutes, then, as previously:

sr 5:0:0:0: tag#10 timing out command, waited 120 seconds.

Then I did a hard reboot.

This morning I detached the cables to the CDROM (but left the disk in
the box) and again booted 5.6.4 with "ignore_loglevel initcall_debug".
After working for a bit, I called `shutdown -h now', and now there were
quite a few more messages, but again the machine did not power off.
Here is the ouput, which I transcribed carefully, hopefully without
error (I omitted the fractional parts of the timestamps and some of the
usb1-portX messages):


Download attachment "no-cdrom-shutdown-5.6.4" of type "application/text/plain" (4495 bytes)


After the last message here, the "echo", Call Trace, RIP and register
messages were repeated exactly as above.  At this point I did a hard
reboot.  A web search found something similar to the above output
("System hangs (bad RIP value) when disk used in pool is removed"), but
didn't otherwise seem to be related to the problem I'm having.

Do you still want me to check whether removing the iwlwifi driver makes
a differece?  And with the CDROM still detached, or does that not
matter?

Steve Berman

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ