[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110907113038.1fed2304@notabene.brown>
Date: Wed, 7 Sep 2011 11:30:38 +1000
From: NeilBrown <neilb@...e.de>
To: Anthony DeRobertis <aderobertis@...rics.net>
Cc: "Linux-kernel mailing list" <linux-kernel@...r.kernel.org>
Subject: Re: Hard lockup in 3.0.3 with Oracle & mdraid check
On Tue, 06 Sep 2011 10:56:50 -0400 Anthony DeRobertis
<aderobertis@...rics.net> wrote:
> [please CC me on replies; I'm not subscribed to linux-kernel]
>
> This is a dual-Xeon E5649, 96GiB RAM, Linux 3.0.3 (+ debian patches, + 1
> local, see bottom of message) running Oracle 11gR2. Oracle is running
> with the data on raw devices, those devices being LVM logical volumes.
> There are several volume groups, each one having on PV. The PV is a
> mdraid RAID10. The underlying disks are 22 SAS, 2 SATA. Most of Oracle's
> IO is going to the SAS disks (the SATA ones are used for archive log).
>
> This had been running fine for several days, until cron fired off a
> check of the raid partitions. Less than a minute after the Oracle ones
> started being checked, the machine died (md0 is /boot, 125 holds
> archive, 126 holds redo, 127 holds data + index)
>
> The machine has ECC memory, and I have seen some corrected errors
> reported (but no uncorrected errors).
>
> [290510.837380] md: data-check of RAID array md0
> [290510.841778] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [290510.847723] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> [290510.857568] md: using 128k window, over a total of 123840k.
> [290510.858748] md: delaying data-check of md125 until md0 has finished (they share one or more physical units)
> [290510.879077] md: delaying data-check of md126 until md0 has finished (they share one or more physical units)
> [290510.879085] md: delaying data-check of md125 until md0 has finished (they share one or more physical units)
> [290510.899702] md: delaying data-check of md127 until md0 has finished (they share one or more physical units)
> [290573.795960] md: md0: data-check done.
> [290573.937561] md: data-check of RAID array md126
> [290573.937564] md: data-check of RAID array md125
> [290573.937567] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [290573.937568] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> [290573.937577] md: using 128k window, over a total of 976635768k.
> [290573.937899] md: data-check of RAID array md127
> [290573.937912] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [290573.937914] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> [290573.937949] md: using 128k window, over a total of 572485632k.
> [290573.994395] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [290574.000317] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> [290574.010139] md: using 128k window, over a total of 143121408k.
> [290624.606210] ------------[ cut here ]------------
> [290624.610918] WARNING: at /home/anthony-ldap/linux/linux-2.6-3.0.0/debian/build/source_amd64_none/kernel/watchdog.c:240 watchdog_overflow_callback+0x96/0xa1()
> [290624.624974] Hardware name: X8DT6
> [290624.628282] Watchdog detected hard LOCKUP on cpu 0
> [290624.632978] Modules linked in: ext2 loop snd_pcm snd_timer snd usbhid hid uhci_hcd soundcore ioatdma ahci snd_page_alloc ehci_hcd tpm_tis libahci i7core_edac tpm psmouse ses ghes dca i2c_i801 tpm_bios hed evdev edac_core libata usbcore serio_raw pcspkr e1000e i2c_core enclosure processor thermal_sys button ext4 mbcache jbd2 crc16 dm_mod raid10 raid1 md_mod shpchp pci_hotplug sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class scsi_mod
> [290624.672916] Pid: 3944, comm: oracle Not tainted 3.0.0-1-amd64 #1
> [290624.679004] Call Trace:
> [290624.681535] <NMI> [<ffffffff810462a8>] ? warn_slowpath_common+0x78/0x8c
> [290624.688422] [<ffffffff8104635a>] ? warn_slowpath_fmt+0x45/0x4a
> [290624.694426] [<ffffffff81091f72>] ? watchdog_overflow_callback+0x96/0xa1
> [290624.701209] [<ffffffff810b30be>] ? __perf_event_overflow+0x101/0x198
> [290624.707734] [<ffffffff810150ec>] ? intel_pmu_enable_all+0x9d/0x144
> [290624.714082] [<ffffffff81018045>] ? intel_pmu_handle_irq+0x40e/0x481
> [290624.720521] [<ffffffff8133a2d4>] ? perf_event_nmi_handler+0x39/0x82
> [290624.726954] [<ffffffff8133bf09>] ? notifier_call_chain+0x2e/0x5b
> [290624.733123] [<ffffffff8133bf80>] ? notify_die+0x2d/0x32
> [290624.738518] [<ffffffff81339b11>] ? do_nmi+0x63/0x206
> [290624.743647] [<ffffffff813395d0>] ? nmi+0x20/0x30
> [290624.748430] [<ffffffff810429f0>] ? try_to_wake_up+0x73/0x18c
> [290624.754259] <<EOE>> <IRQ> [<ffffffff810354a4>] ? __wake_up_common+0x41/0x78
> [290624.761598] [<ffffffff8103a939>] ? __wake_up+0x35/0x46
> [290624.766909] [<ffffffffa009ad46>] ? raid_end_bio_io+0x30/0x76 [raid10]
> [290624.773519] [<ffffffffa009d4f7>] ? raid10_end_write_request+0xdc/0xbe5 [raid10]
> [290624.780995] [<ffffffff811b7c9f>] ? is_swiotlb_buffer+0x26/0x31
> [290624.786998] [<ffffffff81192cb9>] ? blk_update_request+0x1a6/0x35d
> [290624.793261] [<ffffffff81192e81>] ? blk_update_bidi_request+0x11/0x5b
> [290624.799782] [<ffffffff81192fb5>] ? blk_end_bidi_request+0x19/0x55
> [290624.806043] [<ffffffffa0008425>] ? scsi_io_completion+0x1d0/0x48e [scsi_mod]
> [290624.813257] [<ffffffffa0042d4c>] ? mpt2sas_base_free_smid+0x118/0x20e [mpt2sas]
> [290624.820731] [<ffffffff81197303>] ? blk_done_softirq+0x6b/0x78
> [290624.826647] [<ffffffff8104baef>] ? __do_softirq+0xc4/0x1a0
> [290624.832304] [<ffffffff81092bbe>] ? handle_irq_event_percpu+0x166/0x184
> [290624.839001] [<ffffffff8133f49c>] ? call_softirq+0x1c/0x30
> [290624.844571] [<ffffffff8100aa33>] ? do_softirq+0x3f/0x79
> [290624.849966] [<ffffffff8104b8bf>] ? irq_exit+0x44/0xb5
> [290624.855183] [<ffffffff8100a37e>] ? do_IRQ+0x94/0xaa
> [290624.860232] [<ffffffff81339053>] ? common_interrupt+0x13/0x13
> [290624.866140] <EOI> [<ffffffffa0007871>] ? scsi_request_fn+0x468/0x49d [scsi_mod]
> [290624.873723] [<ffffffffa000759a>] ? scsi_request_fn+0x191/0x49d [scsi_mod]
> [290624.880679] [<ffffffff81192aac>] ? blk_flush_plug_list+0x194/0x1d1
> [290624.887030] [<ffffffff813374b8>] ? schedule+0x243/0x61a
> [290624.892425] [<ffffffffa009b18f>] ? wait_barrier+0x8e/0xc7 [raid10]
> [290624.898774] [<ffffffff81042b09>] ? try_to_wake_up+0x18c/0x18c
> [290624.904691] [<ffffffffa009d09b>] ? make_request+0x17b/0x4fb [raid10]
> [290624.911214] [<ffffffff81196e15>] ? ll_new_hw_segment+0x1b/0x7e
> [290624.917221] [<ffffffffa0087f16>] ? md_make_request+0xc6/0x1c1 [md_mod]
> [290624.923913] [<ffffffff81193f06>] ? generic_make_request+0x2cb/0x341
> [290624.930345] [<ffffffffa00ac8c0>] ? dm_get_live_table+0x35/0x3d [dm_mod]
> [290624.937124] [<ffffffff81194056>] ? submit_bio+0xda/0xf8
> [290624.942516] [<ffffffff810be05c>] ? set_page_dirty_lock+0x21/0x29
> [290624.948691] [<ffffffff81125123>] ? dio_bio_submit+0x6c/0x8a
> [290624.954433] [<ffffffff811251af>] ? dio_send_cur_page+0x6e/0x93
> [290624.960427] [<ffffffff81125289>] ? submit_page_section+0xb5/0x135
> [290624.966683] [<ffffffff81125abe>] ? __blockdev_direct_IO+0x670/0x8ed
> [290624.973123] [<ffffffffa0008fe3>] ? scsi_dma_map+0x7a/0x93 [scsi_mod]
> [290624.979643] [<ffffffff81123d8f>] ? blkdev_direct_IO+0x4e/0x53
> [290624.985559] [<ffffffff81123237>] ? blkdev_get_block+0x5b/0x5b
> [290624.991478] [<ffffffff810b74c6>] ? generic_file_aio_read+0xed/0x5c3
> [290624.997914] [<ffffffff810ed40c>] ? virt_to_slab+0x9/0x3c
> [290625.003396] [<ffffffff810b73d9>] ? lock_page_killable+0x2c/0x2c
> [290625.009480] [<ffffffff8112df7c>] ? aio_rw_vect_retry+0x7d/0x180
> [290625.015567] [<ffffffff8112efe5>] ? aio_run_iocb+0x6b/0x132
> [290625.021215] [<ffffffff8112f606>] ? do_io_submit+0x419/0x4c8
> [290625.026951] [<ffffffff8133e292>] ? system_call_fastpath+0x16/0x1b
> [290625.033212] ---[ end trace 3a6a21008417f262 ]---
The implication is that something in this stack is looping endlessly.
Unfortunately it is not clear what - and there are several different
subsystems there that could be blamed.
If this happens again then comparing the new trace with the old could be very
informative - it would point the finger and the highers item in the stack
which is common to both.
Without that extra information - it is just guesswork to go forward.
NeilBrown
> The log keeps on going for a while, with more "rcu_sched_state detected
> stalls on CPUs/tasks: { 0}" messages. It kept spewing them, doing no
> useful work, until I reset the machine. Log (from a serial console) is
> 87M total. 12M compressed. I've placed it at:
> <https://dl.dropbox.com/s/dg8v67kaxolmrub/kuyou-raidcheck-crash.log.gz?dl=1>
>
>
> The extra patch just gets rid of REQ_FLUSH/FUA in mdraid, hence the
> 'eatmydata' in the kernel version. This makes the Oracle workload an
> order of magnitude faster. Here it is:
>
> Index: linux-2.6-3.0.0/drivers/md/md.c
> ===================================================================
> --- linux-2.6-3.0.0.orig/drivers/md/md.c 2011-08-31 15:23:46.474036586 -0400
> +++ linux-2.6-3.0.0/drivers/md/md.c 2011-08-31 15:24:01.294007938 -0400
> @@ -4361,7 +4361,7 @@
> disk->fops = &md_fops;
> disk->private_data = mddev;
> disk->queue = mddev->queue;
> - blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA);
> + blk_queue_flush(mddev->queue, 0); /* omm nom nom, data tasty! */
> /* Allow extended partitions. This makes the
> * 'mdp' device redundant, but we can't really
> * remove it now.
>
>
>
> [please CC me on replies; I'm not subscribed to linux-kernel]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists