[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20140327103525.GF426@ofan>
Date: Thu, 27 Mar 2014 06:35:25 -0400
From: dafreedm@...il.com
To: Jan Kara <jack@...e.cz>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Guennadi Liakhovetski <g.liakhovetski@....de>,
LKML <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, Theodore Ts'o <tytso@....edu>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
Jens Axboe <axboe@...nel.dk>, dafreedm@...il.com
Subject: Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...
Hi,
I've attached another oops (initial one from untainted kernel, and
then successive ones) on the same machine.
Please see the HW stress-testing I've already done below (without
seeing such an oops). Any further suggestions?
Also, how can I tell from the registers you decoded (below) that it's
a bit-flip? (That way I can look at this stuff more myself,
perhaps)...
Thanks.
On Sun, Mar 23, 2014, Daniel Freedman wrote:
> > Hum, so decodecode shows:
> > ...
> > 26: 48 85 c0 test %rax,%rax
> > 29: 74 10 je 0x3b
> > 2b:* 0f b7 80 ac 05 00 00 movzwl 0x5ac(%rax),%eax <-- trapping instruction
> > 32: 66 85 c0 test %ax,%ax
> > ...
> >
> > And the register has:
> > RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> >
> > So that looks like a bitbflip the upper byte.
>
> Just for my own knowledge / growth --- how can you tell there's a
> "bitbflip" on the upper byte?
>
> > So I'd check the hardware first...
>
>
> Yes, I absolutely did check the HW first --- and repeatedly (over a
> couple of weeks) --- before reaching out to LKML.
>
> As described in my original email below, here's what I've done so far:
>
> I've been very extensively testing all of the likely culprits among
> hardware components on both of my servers --- running memtest86 upon
> boot for 3+ days, memtester in userspace for 24 hours, repeated
> kernel compiles with various '-j' values, and the 'stress' and
> 'stressapptest' load generators (see below for full details) --- and
> I have never seen even a hiccup in server operation under such
> "artificial" environments --- however, it consistently occurs with
> heavy md5sum operation, and randomly at other times.
>
> More specifically, here are the exact stept I took to try to implicate
> the HW:
>
> aptitude install memtest86+ # reboot and run for 3+ days
>
> aptitude install memtester
> memtester 30G
>
> aptitude install linux-source
> cp /usr/src/linux-source-3.2.tar.bz2 /root/
> tar xvfj linux-source-3.2.tar.bz2
> cd linux-source-3.2/
> make defconfig
> time make 1>LOG 2>ERR
> make mrproper
> make defconfig
> time make -j16 1>LOG 2>ERR
>
> aptitude install stress
> stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
> stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
> stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m
>
> aptitude install stressapptest
> stressapptest -m 8 -i 4 -C 4 -W -s 30
> stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
> stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -s 30
> stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
> stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
> stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
> stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -n 127.0.0.1 --listen -s 300
>
>
> As mentioned earlier --- I just could not make it oops doing the
> above! (or get any errors in the standalone memtest86+ procedure).
>
> What do you think? Should I just keep on stress-testing it somewhat
> indefinitely? Also, please recall that I have two of the identical
> machines, and I suffer the same problems with both of them (and they
> both pass the above artificial stress-testing).
>
> Thoughts or suggestions, please, for me to explore further...
>
> Thanks again!
View attachment "KernelOops" of type "text/plain" (23903 bytes)
Powered by blists - more mailing lists