linux-kernel - Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140327152618.GD18118@quack.suse.cz>
Date:	Thu, 27 Mar 2014 16:26:18 +0100
From:	Jan Kara <jack@...e.cz>
To:	dafreedm@...il.com
Cc:	Jan Kara <jack@...e.cz>, Thomas Gleixner <tglx@...utronix.de>,
	Guennadi Liakhovetski <g.liakhovetski@....de>,
	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Theodore Ts'o <tytso@....edu>,
	linux-ext4@...r.kernel.org, Jens Axboe <axboe@...nel.dk>
Subject: Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...

  Sorry for the late reply. I'm in a conference this week...

On Sun 23-03-14 10:26:09, dafreedm@...il.com wrote:
> On Sun, Mar 23, 2014, Jan Kara wrote:
> > On Sat 22-03-14 14:18:39, Thomas Gleixner wrote:
> > > On Sat, 22 Mar 2014, dafreedm@...il.com wrote:
> > > 
> > > > Ahh.  Good call.  I wasn't sophisticated enough with these things to
> > > > ascertain the difference.  I knew to avoid reporting oops/panics with
> > > > kernels tainted with out-of-tree (non-free) modules, but I guess I
> > > > grabbed the wrong lines from the dmesg (namely, subsequent oops after
> > > > the initial one).  Here's a more recent kernel oops (from this
> > > > morning) --- it's the first oops after a fresh reboot:
> > > 
> > > Cc'ing ext4 and block folks.
> >   Hum, so decodecode shows:
> > ...
> >   26:	48 85 c0             	test   %rax,%rax
> >   29:	74 10                	je     0x3b
> >   2b:*	0f b7 80 ac 05 00 00 	movzwl 0x5ac(%rax),%eax		<-- trapping instruction
> >   32:	66 85 c0             	test   %ax,%ax
> > ...
> > 
> >   And the register has:
> > RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> > 
> >   So that looks like a bitbflip the upper byte.
> 
> Just for my own knowledge / growth --- how can you tell there's a
> "bitbflip" on the upper byte?
  Kernel addresses start at ffff880000000000. Here RAX should have struct
block_device pointer which is a kernel pointer. But upper byte is 0xf7
instead of 0xff - so very likely single bit (0x0800000000000000) got flipped
from 1 to 0.

> > So I'd check the hardware first...
> 
> Yes, I absolutely did check the HW first --- and repeatedly (over a
> couple of weeks) --- before reaching out to LKML.
> 
> As described in my original email below, here's what I've done so far:
> 
>   I've been very extensively testing all of the likely culprits among
>   hardware components on both of my servers --- running memtest86 upon
>   boot for 3+ days, memtester in userspace for 24 hours, repeated
>   kernel compiles with various '-j' values, and the 'stress' and
>   'stressapptest' load generators (see below for full details) --- and
>   I have never seen even a hiccup in server operation under such
>   "artificial" environments --- however, it consistently occurs with
>   heavy md5sum operation, and randomly at other times.
  Heh, that's strange. So that makes the faulty hw theory less likely -
especially the fact that you see it on two different machines as you
mention below. OTOH the next oops you've posted is at a completely
different place. So that could point to some generic problem where we
corrupt memory.

> More specifically, here are the exact stept I took to try to implicate
> the HW:
> 
>   aptitude install memtest86+  # reboot and run for 3+ days
> 
>   aptitude install memtester
>   memtester 30G
> 
>   aptitude install linux-source
>   cp /usr/src/linux-source-3.2.tar.bz2 /root/
>   tar xvfj linux-source-3.2.tar.bz2
>   cd linux-source-3.2/
>   make defconfig
>   time make 1>LOG 2>ERR
>   make mrproper
>   make defconfig
>   time make -j16 1>LOG 2>ERR
> 
>   aptitude install stress
>   stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
>   stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
>   stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m
> 
>   aptitude install stressapptest
>   stressapptest -m 8 -i 4 -C 4 -W -s 30
>   stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
>   stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -s 30
>   stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
>   stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
>   stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
>   stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -n 127.0.0.1 --listen -s 300
> 
> 
> As mentioned earlier --- I just could not make it oops doing the
> above! (or get any errors in the standalone memtest86+ procedure).
> 
> What do you think?  Should I just keep on stress-testing it somewhat
> indefinitely?  Also, please recall that I have two of the identical
> machines, and I suffer the same problems with both of them (and they
> both pass the above artificial stress-testing).
>
> > > > [33488.170415] general protection fault: 0000 [#1] SMP 
> > > > [33488.171351] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi hid_kensington x86_pkg_temp_thermal coretemp joydev kvm_intel hid_generic kvm crct10dif_pclmul snd_hda_intel crc32_pclmul crc32c_intel snd_hda_codec usbhid snd_hwdep hid ghash_clmulni_intel snd_pcm iTCO_wdt iTCO_vendor_support snd_page_alloc aesni_intel i915 snd_seq aes_x86_64 snd_seq_device snd_timer lrw evdev gf128mul glue_helper drm_kms_helper ablk_helper snd psmouse cryptd drm i2c_i801 pcspkr soundcore serio_raw lpc_ich mei_me mei mfd_core video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci xhci_hcd ehci_pci ehci_hcd e1000e libata igb scsi_mod i2c_algo_bit i2c_core dca ptp pps_core fan usbcore usb_common thermal therma
> > >  l_sys
> > > > [33488.177102] CPU: 0 PID: 340 Comm: jbd2/sdd2-8 Not tainted 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
> > > > [33488.178100] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
> > > > [33488.179095] task: ffff88081b179080 ti: ffff88081b3ee000 task.ti: ffff88081b3ee000
> > > > [33488.180102] RIP: 0010:[<ffffffff811b6b22>]  [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > > > [33488.181117] RSP: 0018:ffff88081b3efb78  EFLAGS: 00010282
> > > > [33488.182127] RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> > > > [33488.183142] RDX: 00000000000001ff RSI: 000000000148b2ac RDI: 0000000000000000
> > > > [33488.184152] RBP: ffff88081b2a4800 R08: 0000000000000000 R09: ffff8807eccca628
> > > > [33488.185168] R10: 00000000000032ad R11: 0000000000000001 R12: 0000000000000000
> > > > [33488.186182] R13: ffff88081bcf67c0 R14: 00000000532d22fd R15: 000000000148b2ac
> > > > [33488.187194] FS:  0000000000000000(0000) GS:ffff88083fa00000(0000) knlGS:0000000000000000
> > > > [33488.188212] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [33488.189228] CR2: ffffffffff600400 CR3: 000000000180c000 CR4: 00000000001407f0
> > > > [33488.190245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [33488.191268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > [33488.192279] Stack:
> > > > [33488.193273]  0000000000000000 ffffffffa01ec1ba ffff88081b179080 ffff88081b2a4800
> > > > [33488.194276]  00000000000032ac 0000000000000000 ffff88081b2a4800 ffff88081b2a4800
> > > > [33488.195265]  ffff8807f36b4688 00000000532d22fd 00000000ffffffff ffffffffa01737d0
> > > > [33488.196246] Call Trace:
> > > > [33488.197222]  [<ffffffffa01ec1ba>] ? ext4_bmap+0x5a/0x110 [ext4]
> > > > [33488.198210]  [<ffffffffa01737d0>] ? jbd2_journal_get_descriptor_buffer+0x30/0x80 [jbd2]
> > > > [33488.199203]  [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > > > [33488.200196]  [<ffffffffa016aedd>] ? journal_submit_commit_record.isra.12+0x9d/0x200 [jbd2]
> > > > [33488.201193]  [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > > > [33488.202198]  [<ffffffffa016c1f4>] ? jbd2_journal_commit_transaction+0x11b4/0x18c0 [jbd2]
> > > > [33488.203211]  [<ffffffffa01706cc>] ? kjournald2+0xac/0x240 [jbd2]
> > > > [33488.204219]  [<ffffffff81082d20>] ? add_wait_queue+0x60/0x60
> > > > [33488.205226]  [<ffffffffa0170620>] ? commit_timeout+0x10/0x10 [jbd2]
> > > > [33488.206230]  [<ffffffff81082333>] ? kthread+0xb3/0xc0
> > > > [33488.207230]  [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > > > [33488.208231]  [<ffffffff814cb70c>] ? ret_from_fork+0x7c/0xb0
> > > > [33488.209224]  [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > > > [33488.210205] Code: 12 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 85 98 00 00 00 ba ff 01 00 00 48 8b 80 58 03 00 00 48 85 c0 74 10 <0f> b7 80 ac 05 00 00 66 85 c0 0f 85 cd 01 00 00 85 da 0f 85 d2 
> > > > [33488.212277] RIP  [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > > > [33488.213276]  RSP <ffff88081b3efb78>
> > > > [33488.370823] ---[ end trace cf90c18d45ff9570 ]---
> > > > 
> > > > 
> > > > Thoughts?
> > > > 
> > > > Ingo, Peter, Thomas, any further ideas, please?
> > > > 
> > > > 
> > > > > > Though at times the oops occur even when the system is largely idle,
> > > > > > they seem to be exacerbated by md5sum'ing all files on a large
> > > > > > partition as part of archive verification --- say 1 million files
> > > > > > corresponding to 1 TByte of storage.  If I perform this repeatedly,
> > > > > > the machines seem to lock up about once a week.  Strangely, other
> > > > > > typical high-load/high-stress scenarios don't seem to provoke the oops
> > > > > > nearly so much (see below).
> > > > > > 
> > > > > > Naturally, such md5sum usage is putting heavy load on the processor,
> > > > > > memory, and even power supply, and my initial inclination is generally
> > > > > > that I must have some faulty components.  Even after otherwise
> > > > > > ambiguous diagnostics (described below), I'm highly skeptical that
> > > > > > there's anything here inherent to the md5sum codebase, in particular.
> > > > > > However, I have started to wonder whether this might be a kernel
> > > > > > regression...
> > > > > > 
> > > > > > For reference, here's my setup:
> > > > > > 
> > > > > >   Mainboard:  Supermicro X10SLQ
> > > > > >   Processor:  (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
> > > > > >   Memory:     32GB Kingston DDR3 RAM (4x KVR16N11/8)
> > > > > >   PSU:        SeaSonic SS-400FL2 400W PSU
> > > > > >   O/S:        Debian v7.4 Wheezy (amd64)
> > > > > >   Filesystem: Ext4 (with default settings upon creation) over LUKS
> > > > > >   Kernel:     Using both:
> > > > > >                 Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
> > > > > >                 Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)
> > > > > > 
> > > > > > To summarize where I am now: I've been very extensively testing all of
> > > > > > the likely culprits among hardware components on both of my servers
> > > > > > --- running memtest86 upon boot for 3+ days, memtester in userspace
> > > > > > for 24 hours, repeated kernel compiles with various '-j' values, and
> > > > > > the 'stress' and 'stressapptest' load generators (see [2] for full
> > > > > > details) --- and I have never seen even a hiccup in server operation
> > > > > > under such "artificial" environments --- however, it consistently
> > > > > > occurs with heavy md5sum operation, and randomly at other times.
> > > > > > 
> > > > > > At least from my past experiences (with scientific HPC clusters), such
> > > > > > diagnostic results would normally seem to largely rule out most
> > > > > > problems with the processor, memory, mainboard subsystems.  The PSU is
> > > > > > often a little harder to rule out, but the 400W Seasonic PSUs are
> > > > > > rated at 2--3 times the wattage I should really need, even under peak
> > > > > > load (given each server's single-socket CPU is 65W at max TDP, there
> > > > > > are only a few HDs and one SSD, and no discrete graphics at all, of
> > > > > > course).
> > > > > > 
> > > > > > I'm further surprised to see the exact same kernel-crash behavior on
> > > > > > two separate, but identical, servers, which leads me to wonder if
> > > > > > there's possibly some regression between the hardware (given that it's
> > > > > > relatively new Haswell microcode / silicon) and the (kernel?)
> > > > > > software.
> > > > > > 
> > > > > > Any thoughts on what might be occurring here?  Or what I should focus
> > > > > > on?  Thanks in advance.
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/