linux-kernel - Machine crashes right *after* ~successful resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <54347520.3050109@gaast.net>
Date:	Wed, 08 Oct 2014 00:20:00 +0100
From:	Wilmer van der Gaast <wilmer@...st.net>
To:	rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org
Subject: Machine crashes right *after* ~successful resume

Hello,

Rafael, including you on this since 
http://linuxconcloudopenna2013.sched.org/event/d708f47d07cd44b9669610778c024708#.VDRzTDS_EUF 
mentions you as the maintainer for Linux + power management. I hope this 
is still accurate.

Since Linux 3.12 (Debian version 3.12.9-1~bpo70+1) and all the way up to 
3.16 (Debian version 3.16.3-2), I'm having suspend-resume issues on my 
machine (Intel Z68, i7-3770K) that are somewhat less obvious.

After every boot, I get two successful suspend+resume cycles, but after 
the third suspend, it won't resume successfully. On the VGA console I've 
never had anything useful logged, luckily over the serial console I've 
had more luck. I seem to get as far as:

[  153.787678] PM: resume of devices complete after 3797.737 msecs
[  153.787775] PM: resume devices took 3.796 seconds
[  154.238612] Restarting tasks ... done.

And indeed, while testing I was running a "ping -i0.01" to a host on my 
network, and it managed to get a few packets out. Timing already seems 
quite off though:

22:11:49.515489 IP 192.168.44.101 > 192.168.44.100: ICMP echo request, 
id 3074, seq 894, length 64
22:11:49.982265 IP 192.168.44.101 > 192.168.44.100: ICMP echo request, 
id 3074, seq 895, length 64
22:11:50.986779 IP 192.168.44.101 > 192.168.44.100: ICMP echo request, 
id 3074, seq 896, length 64

Note the gaps that are 0.4-1.0s instead of the 0.01s they should've 
been. To me these pings going *out* sound like userland's definitely 
waking up for a while, or at least some processes are. Also, for several 
seconds even during earlier stages of the resume, the machine is already 
responding to echo requests.

Sadly after this message to my serial console and these few ICMP 
packets, the machine locks up quite hard, to the point that SysRq 
doesn't respond anymore. :-(

This is happening for a while already and makes suspend+resume mostly 
useless on my machine. What other debugging info can I provide to help 
with getting this fixed?

I've found out about pm_trace, which always points at the same line (and 
no device):

/var/log/syslog.1:Oct 10 16:43:58 ruby kernel: [    0.780503]   Magic 
number: 0:52:740
/var/log/syslog.1:Oct 10 16:43:58 ruby kernel: [    0.780599]   hash 
matches /tmp/linux-3.16.3/drivers/base/power/main.c:812

In my source tree that line is:

         TRACE_RESUME(error);

Right at the end of device_resume(), under the Complete: label. Note 
that I might have to redo this though, as I now realise I had only 
recompiled my *kernel* with the PM_TRACE_RTC flag set, not all my 
modules, which I assume is not enough. (I'm thinking of filing a Debian 
bug requesting this flag to be enabled by default..) However since the 
kernel seems to declare the resume as complete I'm not sure whether 
pm_trace is still of any use?

With kernels 3.10 and older I have no such problems, I can 
suspend+resume as often as I want.

I've already tried to skip the NVidia + VMware modules at boot time (as 
you can see from the logs they're not loaded at any point), but it 
didn't help. I could try omitting more modules.

I'm attaching a full dmesg of boot + a few suspend+resume cycles in 3.10 
and 3.16, and a dump of the serial console showing the last resume cycle 
(which I couldn't get from dmesg of course).

You might notice the message about s2ram segfaulting which I've looked 
at, that seems to be VBE-related code, but this problem occurs even when 
I just echo ram to /sys/power/state directly without using s2ram, so I 
assume it's not related.

Sorry for the long message. I'd love some ideas for troubleshooting an 
issue like this.

"Attachments" in http://roy.gaast.net/~wilmer/.lkml/ since I just 
realised >200KB of attachments might not be appreciated. :-)

Cheers,

Wilmer van der Gaast.

-- 
+-------- .''`.     - -- ---+  +        - -- --- ---- ----- ------+
| wilmer : :'  :  gaast.net |  | OSS Programmer   www.bitlbee.org |
| lintux `. `~'  debian.org |  | Full-time geek  wilmer.gaast.net |
+--- -- -  ` ---------------+  +------ ----- ---- --- -- -        +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/