linux-kernel - 4.9.130: CPU soft lockups and other weird memory errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <cf45a4ee7a201cc56ac9bd0764e4336c93fa6f37.camel@lmu.de>
Date:   Tue, 09 Apr 2019 16:14:09 +0200
From:   Christoph Anton Mitterer <christoph.anton.mitterer@....de>
To:     linux-kernel@...r.kernel.org
Subject: 4.9.130: CPU soft lockups and other weird memory errors

Hey.

Perhaps anyone can help with the following, which is a problem at a
mass storage system cluster at the physics faculty here:

The cluster consists of 40 nodes all running Debian stable with a
4.9.130 kernel serving some ~3 PiB storage via 10GbE networking.
Part of the nodes are some Dell PowerEdges/PowerVaults, the others are
HP ProLiant DL380 Gen9.
All of them have basically the same configuration (except of course
obvious things like IP addresses, etc.) and all should have plenty
memory (the HPs 64 GiB, the Dells 32 GiB).

The following two(?) problems occur only on the HP nodes (which is IMO
some indication that it's a hardware/kernel problem):

HP nodes regularly get stuck with either some strange memory or CPU
soft lockup errors being printed endlessly to the serial console (see
attached files for some examples):

When this starts to happen, the system may come back a few times for
some seconds but then it usually ends up in an endless loop of these
errors out of which only a hard reset helps (everything else like
serial console, ssh no longer reacts).

The problem seems to occur whenever system load goes up, especially
"higher" network load seems to cause the issue.
I say "higher" because it doesn't seem having to be that much. One
example of a node that crashed today, had a 1/5/15 min load of ~60 and
something between 40-60 MB/s of received bytes (and basically nothing
sent).

Any idea on how to fix that respectively further trace it down would be
highly appreciated.

Cheers,
Chris.

Download attachment "mem1.log.xz" of type "application/x-xz" (4680 bytes)

Download attachment "mem2.log.xz" of type "application/x-xz" (3456 bytes)

Download attachment "mem3.log.xz" of type "application/x-xz" (3104 bytes)

Download attachment "mem-followed-by-softlockup.log.xz" of type "application/x-xz" (10304 bytes)

Download attachment "soft-lockup1.log.xz" of type "application/x-xz" (4068 bytes)

Download attachment "soft-lockup2.log.xz" of type "application/x-xz" (3788 bytes)

Download attachment "soft-lockup3.log.xz" of type "application/x-xz" (4092 bytes)