lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v9l65d2y.fsf@gmx.net>
Date:   Fri, 08 May 2020 23:30:45 +0200
From:   Stephen Berman <stephen.berman@....net>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org
Subject: Re: power-off delay/hang due to commit 6d25be57 (mainline)

On Wed, 6 May 2020 23:57:13 +0200 Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:

> On 2020-05-01 17:46:48 [+0200], Stephen Berman wrote:
>> I'm experiencing a delay or hang in powering off my computer after `halt
>> -d -f -i -p' and I've bisected it to this commit in the mainline tree:
>
> You refer to a normal "poweroff" or is this some kind of "shutdown now"
> kind of thing? Unless I'm mistaken, the `halt' command above will turn
> off the machine without stopping user tasks (as in "reaching shutdown
> level").

What I actually do is `shutdown -h now' but my understanding is that (at
least in LFS) the last thing that happens when `shutdown -h now' runs is
effectively the same as `halt -d -f -i -p'.  The reason I wrote that
above is because I did not start timing immediately when I entered
`shutdown -h now', since it displays messages of the actions it takes,
like unmounting filesystems and so on.  I start timing after the last
message before powering off (or before the hang/delay), which in LFS is
"Bringing down the loopback interface".  When powering off succeeds
normally, it takes ~4 seconds after that message appears.  I've since
tested directly entering `halt -d -f -i -p' with kernel 5.1.0: there
were no messages like with `shutdown -h now' and after ~4 seconds, the
machine powered off, which seems to confirm my understanding.  I also
tested `halt -d -f -i -p' with kernel 5.6.4, and here, too, there were
no messages like with `shutdown -h now' but after ~7 seconds this
appeared in the tty:

  sd 4:0:0:0: [sda] tag#13 timing out command, waited 7s

and after two minutes, this:

  sr 5:0:0:0: tag#14 timing out command, waited 120s

and then I pressed the reset button.

> …
>> The delays have varied in length from ~20 seconds to seeminingly
>> indefinitely long (the longest I've waited before pressing the start
>> button on the machine is 6 minutes).  With kernels prior to this commit,
>> my machine powers off within 4 seconds after the halt invocation.
>
> So you say that normally the machine shuts down immediately but now it
> takes 20secs+ if it shuts down at all?

Yes.

> …
>>   sr 5:0:0:0: tag#21 timing out command, waited 120s
>
> The CD-drive is polled regularly so it notices when a CD us inserted /
> removed. It seems that one of the requests wasn't answered and it ended
> up in timeout.

But the drive is always not in use and empty when I run `shutdown -h
now' (in fact I rarely use that drive at all).  And the message I
referred to above "sd 4:0:0:0: [sda] tag#13 timing out command, waited
7s" seems to indicate that other polling requests also aren't answered.

> …
>> minutes.  But after running emacs in a tty (and then killing it), the
>> delay was seemingly indefinite, as it is with my usual workflow with X,
>> emacs, firefox, etc. running (but I always kill them before invoking
>> `shutdown -h now' from a tty).  During the bisection I tested each
>> kernel build by running startx, emacs, firefox, killing these, then
>> `shutdown -h now'.  I've attached the git log of the bisection augmented
>> with power-off timing notes.  (FWIW, when I reboot the machine with
>> `shutdown -r now' there is no delay with the problematic kernels.)
>
> So reboot is not affected, just the halt. 

Yes.

>                                           Could you please check if the
> "poweroff" variant is also affected?

I have tried `poweroff' and also `shutdown -hP now' and these make no
difference: with kernel 5.1.0 and earlier, the machine powers off
promptly, with 5.2.0 and later it doesn't.

> …
>> Another pointer I got was to the recent commit 62849a96, which fixes a
>> bug due to the commit that causes my problem.  I applied 62849a96 to
>> 5.6.4, but it did not prevent the delay/hang.
>
> You should see a warning if you were facing the problem described in the
> commit.

I didn't see any warning, so that doesn't seem related to my issue.

> …
>>                                     Please let me know if there's any
>> other information I can provide or anything else I can do to help with
>> debugging this issue.  I hope you can fix it.
>
> Can you send dmesg output of the system / lspci -k? I'm mainly
> interested what drivers are bound to storage devices (you can send it
> offlist if you want).

I've attached them (I included dmesg both from the "good" kernel 5.1.0
and the "bad" kernel 5.6.4; they differ somewhat but I don't know if the
differences are significant).

> Can you log the output on the serial console?

How do I do that?

> If the commit you cited is really the problem then it would mean that a
> worker isn't scheduled for some reason. Could you please enable
> CONFIG_WQ_WATCHDOG to see if workqueue core code notices that a worker
> isn't making progress?

How will I know if that happens, is there a specific message in the tty?

Thanks for your reply.

Steve Berman


Download attachment "dmesg-5.1.0" of type "application/octet-stream" (58875 bytes)

Download attachment "dmesg-5.6.4" of type "application/octet-stream" (60735 bytes)

Download attachment "lspci-k-5.6.4" of type "application/octet-stream" (3480 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ