linux-kernel - Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.1403192239560.5202@axis700.grange>
Date:	Wed, 19 Mar 2014 22:47:56 +0100 (CET)
From:	Guennadi Liakhovetski <g.liakhovetski@....de>
To:	dafreedm@...il.com
cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...

Hi

On Tue, 18 Mar 2014, dafreedm@...il.com wrote:

> First-time poster to LKML, though I've been a Linux user for the past
> 15+ years.  Thanks to you all for your collective efforts at creating
> such a great (useful, stable, etc) kernel...
> 
> Problem at hand: I'm getting consistent kernel oops (at times,
> hard-crashes) on two of my identical servers (they are much more
> common on one of the servers than the other, but I see them on both).
> Please reference the kernel log messages appended to this email [1].

No, unfortunately I won't be able to help directly, mostly just CC-ing 
X86 maintainers. Personally, what I would do, I would first not report any 
Oopses or warnings after the kernel has already been tainted - probably by 
a previous Oops. Secondly, I would try to exclude modules from 
configurations and see, whether Oopses still occur, e.g. is dm-crypt 
always in use when you get Oopses or you can reproduce them without 
encryption?

Thanks
Guennadi

> Though at times the oops occur even when the system is largely idle,
> they seem to be exacerbated by md5sum'ing all files on a large
> partition as part of archive verification --- say 1 million files
> corresponding to 1 TByte of storage.  If I perform this repeatedly,
> the machines seem to lock up about once a week.  Strangely, other
> typical high-load/high-stress scenarios don't seem to provoke the oops
> nearly so much (see below).
> 
> Naturally, such md5sum usage is putting heavy load on the processor,
> memory, and even power supply, and my initial inclination is generally
> that I must have some faulty components.  Even after otherwise
> ambiguous diagnostics (described below), I'm highly skeptical that
> there's anything here inherent to the md5sum codebase, in particular.
> However, I have started to wonder whether this might be a kernel
> regression...
> 
> For reference, here's my setup:
> 
>   Mainboard:  Supermicro X10SLQ
>   Processor:  (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
>   Memory:     32GB Kingston DDR3 RAM (4x KVR16N11/8)
>   PSU:        SeaSonic SS-400FL2 400W PSU
>   O/S:        Debian v7.4 Wheezy (amd64)
>   Filesystem: Ext4 (with default settings upon creation) over LUKS
>   Kernel:     Using both:
>                 Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
>                 Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)
> 
> To summarize where I am now: I've been very extensively testing all of
> the likely culprits among hardware components on both of my servers
> --- running memtest86 upon boot for 3+ days, memtester in userspace
> for 24 hours, repeated kernel compiles with various '-j' values, and
> the 'stress' and 'stressapptest' load generators (see [2] for full
> details) --- and I have never seen even a hiccup in server operation
> under such "artificial" environments --- however, it consistently
> occurs with heavy md5sum operation, and randomly at other times.
> 
> At least from my past experiences (with scientific HPC clusters), such
> diagnostic results would normally seem to largely rule out most
> problems with the processor, memory, mainboard subsystems.  The PSU is
> often a little harder to rule out, but the 400W Seasonic PSUs are
> rated at 2--3 times the wattage I should really need, even under peak
> load (given each server's single-socket CPU is 65W at max TDP, there
> are only a few HDs and one SSD, and no discrete graphics at all, of
> course).
> 
> I'm further surprised to see the exact same kernel-crash behavior on
> two separate, but identical, servers, which leads me to wonder if
> there's possibly some regression between the hardware (given that it's
> relatively new Haswell microcode / silicon) and the (kernel?)
> software.
> 
> Any thoughts on what might be occurring here?  Or what I should focus
> on?  Thanks in advance.
> 
> 
> 
> [1] Attached 'KernelLogs' file.
> [2] Attached 'SystemStressTesting' file.
> 

---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/