lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <cf45a4ee7a201cc56ac9bd0764e4336c93fa6f37.camel@lmu.de>
Date:   Tue, 09 Apr 2019 16:14:09 +0200
From:   Christoph Anton Mitterer <christoph.anton.mitterer@....de>
To:     linux-kernel@...r.kernel.org
Subject: 4.9.130: CPU soft lockups and other weird memory errors

Hey.

Perhaps anyone can help with the following, which is a problem at a
mass storage system cluster at the physics faculty here:

The cluster consists of 40 nodes all running Debian stable with a
4.9.130 kernel serving some ~3 PiB storage via 10GbE networking.
Part of the nodes are some Dell PowerEdges/PowerVaults, the others are
HP ProLiant DL380 Gen9.
All of them have basically the same configuration (except of course
obvious things like IP addresses, etc.) and all should have plenty
memory (the HPs 64 GiB, the Dells 32 GiB).

The following two(?) problems occur only on the HP nodes (which is IMO
some indication that it's a hardware/kernel problem):



HP nodes regularly get stuck with either some strange memory or CPU
soft lockup errors being printed endlessly to the serial console (see
attached files for some examples):

When this starts to happen, the system may come back a few times for
some seconds but then it usually ends up in an endless loop of these
errors out of which only a hard reset helps (everything else like
serial console, ssh no longer reacts).

The problem seems to occur whenever system load goes up, especially
"higher" network load seems to cause the issue.
I say "higher" because it doesn't seem having to be that much. One
example of a node that crashed today, had a 1/5/15 min load of ~60 and
something between 40-60 MB/s of received bytes (and basically nothing
sent).


Any idea on how to fix that respectively further trace it down would be
highly appreciated.


Cheers,
Chris.

Download attachment "mem1.log.xz" of type "application/x-xz" (4680 bytes)

Download attachment "mem2.log.xz" of type "application/x-xz" (3456 bytes)

Download attachment "mem3.log.xz" of type "application/x-xz" (3104 bytes)

Download attachment "mem-followed-by-softlockup.log.xz" of type "application/x-xz" (10304 bytes)

Download attachment "soft-lockup1.log.xz" of type "application/x-xz" (4068 bytes)

Download attachment "soft-lockup2.log.xz" of type "application/x-xz" (3788 bytes)

Download attachment "soft-lockup3.log.xz" of type "application/x-xz" (4092 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ