linux-kernel - Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20140327103525.GF426@ofan>
Date:	Thu, 27 Mar 2014 06:35:25 -0400
From:	dafreedm@...il.com
To:	Jan Kara <jack@...e.cz>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Guennadi Liakhovetski <g.liakhovetski@....de>,
	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Theodore Ts'o <tytso@....edu>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>, dafreedm@...il.com
Subject: Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...

Hi,

I've attached another oops (initial one from untainted kernel, and
then successive ones) on the same machine.

Please see the HW stress-testing I've already done below (without
seeing such an oops).  Any further suggestions?

Also, how can I tell from the registers you decoded (below) that it's
a bit-flip?  (That way I can look at this stuff more myself,
perhaps)...

Thanks.



On Sun, Mar 23, 2014, Daniel Freedman wrote:
> >   Hum, so decodecode shows:
> > ...
> >   26: 48 85 c0                test   %rax,%rax
> >   29: 74 10                   je     0x3b
> >   2b:*        0f b7 80 ac 05 00 00    movzwl 0x5ac(%rax),%eax         <-- trapping instruction
> >   32: 66 85 c0                test   %ax,%ax
> > ...
> >
> >   And the register has:
> > RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> >
> >   So that looks like a bitbflip the upper byte.
> 
> Just for my own knowledge / growth --- how can you tell there's a
> "bitbflip" on the upper byte?
> 
> > So I'd check the hardware first...
> 
> 
> Yes, I absolutely did check the HW first --- and repeatedly (over a
> couple of weeks) --- before reaching out to LKML.
> 
> As described in my original email below, here's what I've done so far:
> 
>   I've been very extensively testing all of the likely culprits among
>   hardware components on both of my servers --- running memtest86 upon
>   boot for 3+ days, memtester in userspace for 24 hours, repeated
>   kernel compiles with various '-j' values, and the 'stress' and
>   'stressapptest' load generators (see below for full details) --- and
>   I have never seen even a hiccup in server operation under such
>   "artificial" environments --- however, it consistently occurs with
>   heavy md5sum operation, and randomly at other times.
> 
> More specifically, here are the exact stept I took to try to implicate
> the HW:
> 
>   aptitude install memtest86+  # reboot and run for 3+ days
> 
>   aptitude install memtester
>   memtester 30G
> 
>   aptitude install linux-source
>   cp /usr/src/linux-source-3.2.tar.bz2 /root/
>   tar xvfj linux-source-3.2.tar.bz2
>   cd linux-source-3.2/
>   make defconfig
>   time make 1>LOG 2>ERR
>   make mrproper
>   make defconfig
>   time make -j16 1>LOG 2>ERR
> 
>   aptitude install stress
>   stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
>   stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
>   stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m
> 
>   aptitude install stressapptest
>   stressapptest -m 8 -i 4 -C 4 -W -s 30
>   stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
>   stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -s 30
>   stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
>   stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
>   stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
>   stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -n 127.0.0.1 --listen -s 300
> 
> 
> As mentioned earlier --- I just could not make it oops doing the
> above! (or get any errors in the standalone memtest86+ procedure).
> 
> What do you think?  Should I just keep on stress-testing it somewhat
> indefinitely?  Also, please recall that I have two of the identical
> machines, and I suffer the same problems with both of them (and they
> both pass the above artificial stress-testing).
> 
> Thoughts or suggestions, please, for me to explore further...
> 
> Thanks again!

View attachment "KernelOops" of type "text/plain" (23903 bytes)