linux-kernel - Re: 4.4: INFO: rcu_sched self-detected stall on CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56FABF17.7090608@crc.id.au>
Date:	Wed, 30 Mar 2016 04:44:55 +1100
From:	Steven Haigh <netwiz@....id.au>
To:	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	xen-devel <xen-devel@...ts.xenproject.org>,
	linux-kernel@...r.kernel.org
Cc:	"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>
Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU

On 30/03/2016 1:14 AM, Boris Ostrovsky wrote:
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>      http://paste.fedoraproject.org/346802/59241532
> 
> Would be good to see full console log. The one that you posted starts
> with an error so I wonder what was before that.

Agreed. It started off with me observing this on one VM - but since
trying to get details on that VM - others have started showing issues as
well. It frustrating as it seems I've been playing whack-a-mole to get
more debug on what is going on.

So, I've changed the kernel command line to the following on ALL VMs on
this system:
enforcemodulesig=1 selinux=0 fsck.repair=yes loglevel=7 console=tty0
console=ttyS0,38400n8

In the Dom0 (which runs the same kernel package), I've started a screen
sessions with a screen for each of the DomUs running attached to the
console via 'xl console blah' - so hopefully the next one that goes down
(whichever one that is) will get caught in the console.

> Have you tried this on bare metal, BTW? And you said this is only
> observed on 4.4, not 4.5, right?

I use the same kernel package as the Dom0 kernel - and so far haven't
seen any issues running this as the Dom0. I haven't used it on baremetal
as a non-xen kernel as yet.

The kernel package I'm currently running is for CentOS / Scientific
Linux / RHEL at:
    http://au1.mirror.crc.id.au/repo/el7-testing/x86_64/

I'm using 4.4.6-3 at the moment - which has CONFIG_PREEMPT_VOLUNTARY set
- which *MAY* have increased the time between this happening - or may
have no effect at all. I'm not convinced either way as yet.

With respect to 4.5, I have had reports from another user of my packages
that they haven't seen the same crash using the same Xen packages but
with kernel 4.5. I have not verified this myself as yet as I haven't
gone down the path of making 4.5 packages for testing. As such, I
wouldn't treat this as a conclusive test case as yet.

I'm hoping that the steps I've taken above may give some more
information in which we can drill down into exactly what is going on -
or at least give more pointers into the root cause.

>>
>> Interestingly, around the same time, /var/log/messages on the remote
>> syslog server shows:
>> Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
>> Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
>> Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
>> Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
>> Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
>> offset -0.002247 sec
>> Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
>> Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
>> Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
>> Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
>> Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
>> (limit 1min)!
>> Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
>> out. Terminating.
>> Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
>> out. Killing.
>> Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
>> SIGKILL. Ignoring.
>> Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
>> timed out. Killing.
>> Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after
>> final SIGKILL. Entering failed mode.
>> Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed
>> state.
>> Mar 29 17:38:26 zeus systemd: systemd-logind.service failed.
> 
> 
> These may be result of your system not feeling well, which is not
> surprising.
> 
> -boris

-- 
Steven Haigh

Email: netwiz@....id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



Download attachment "signature.asc" of type "application/pgp-signature" (820 bytes)