[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20061116234358.GJ11034@melbourne.sgi.com>
Date: Fri, 17 Nov 2006 10:43:58 +1100
From: David Chinner <dgc@....com>
To: linux-kernel@...ith.clara.net
Cc: linux-kernel@...r.kernel.org, xfs@....sgi.com
Subject: Re: GPF oops on 2.6.18-1.2200.fc5 and repeated DWARF2 unwinder XFS errors under 2.6.18-1.2239.fc5
On Wed, Nov 15, 2006 at 03:06:16PM +0000, linux-kernel@...ith.clara.net wrote:
>
> Hi,
>
> I just started up a new box yesterday with Fedora Core 5. Its running with
> 2 dual core AMD Opteron 2220 SE's and 24Gb of memory and an Adaptec SCSI
> card and I've had a number of errors which I can't seem to find solutions
> for. I'd had no end of problems with spinlock issues in the aacraid driver
> in the 2.6.17 series on another dual opteron box, but on hitting
> 2.6.18-1.2200 these went away, so I started the new box off with
> 2.6.18-1.2200 as well. As I understand it, this is 2.6.18.1 as compiled
> by Redhat/Fedora and includes various DWARD2 unwinder fixes.
>
> Well this caused a GPF and the following trace:
>
> -----------
>
> general protection fault: 0000 [1] SMP
> last sysfs file: /class/net/sit0/address
> CPU 1
> Modules linked in: nls_utf8 ipv6 ip_conntrack_ftp ip_conntrack_netbios_ns ipt_owner ipt_LOG xt_limit ipt_REJECT xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables xfs dm_mod video sbs i2c_ec button battery asus_acpi ac lp parport_pc parport ide_cd cdrom sg ehci_hcd ohci_hcd i2c_nforce2 i2c_core forcedeth serio_raw k8_edac edac_mc shpchp pcspkr ext3 jbd sata_nv libata aacraid sd_mod scsi_mod
> Pid: 1093, comm: gawk Not tainted 2.6.18-1.2200.fc5 #1
> RIP: 0010:[<ffffffff8826b4c5>] [<ffffffff8826b4c5>]
> :xfs:xfs_bmap_search_extents+0x1c/0xcb
> RSP: 0018:ffff8105fd653b40 EFLAGS: 00010202
> RAX: ffffffff806785a0 RBX: ffff8105fd653d28 RCX: ffff8105fd653d70
> RDX: 0000000000000000 RSI: 00000000000033ce RDI: ffff8102fe801080
> RBP: ffff8105fd653b40 R08: ffff8105fd653d6c R09: ffff8105fd653d28
> R10: ffff8105fd653d70 R11: ffff8102f4655250 R12: ffff8105fd653d6c
> R13: ffff8105ff04d800 R14: 0007ffffffffcc32 R15: ffff8105fd653de8
> FS: 00002aaaab093e00(0000) GS:ffff8102ffc3b1c0(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00002aaaaae4a020 CR3: 0000000000201000 CR4: 00000000000006e0
> Process gawk (pid: 1093, threadinfo ffff8105fd652000, task
> ffff8105fd4f4810)
> Stack: ffff8102fe801080 0000000000000005 0000000000000000 ffff8105ff04d800
> ffffffff8826b972 ffff8105fd653d08 0000000000000007 0000000000000048
> 0000000000000000 000000000000029b 0000000000100000 ffff8105fd653c18
> Call Trace:
> [<ffffffff8826b972>] :xfs:xfs_bmapi+0x2d2/0x1b66
> [<ffffffff8829dfba>] :xfs:xfs_inactive_free_eofblocks+0xa3/0x1ec
> [<ffffffff882a13cc>] :xfs:xfs_release+0x97/0xc8
> [<ffffffff882a820e>] :xfs:xfs_file_release+0x1a/0x1e
> [<ffffffff8021239b>] __fput+0xbf/0x1aa
> [<ffffffff8021a4de>] remove_vma+0x4e/0x75
> [<ffffffff8023a035>] exit_mmap+0xcf/0xf3
> [<ffffffff8023c1c1>] mmput+0x41/0x96
> [<ffffffff802150e2>] do_exit+0x28c/0x8c3
> [<ffffffff80247d0e>] cpuset_exit+0x0/0x6c
> [<00002aaaab089888>]
>
>
> Code: 18 4c 8b 4c 24 40 65 8b 0c 25 2c 00 00 00 48 63 c9 48 8b 0c
> RIP [<ffffffff8826b4c5>] :xfs:xfs_bmap_search_extents+0x1c/0xcb
> RSP <ffff8105fd653b40>
> <1>Fixing recursive fault but reboot is needed!
>
> -----------
>
> At the time the box was sitting there doing nothing but running openssh.
> (This gawk process seems to be from anacron kicking in 'makewhatis').
> The machine didn't die but didn't seem happy. I searching I discovered a
> number of people with the same message "general protection fault: 0000 [1]
> SMP" on lots of different processes so I assumed that it wasn't related
> to the XFS drivers directly, but to a problem somewhere else which is
> being triggered by the dual-core opterons (could heat be a factor as its
> just sitting on a desk in the office not in a machine room?).
>
> Anyway since this had happened I decided to upgrade to the next Fedora
> kernel 2.6.18-1.2239.fc5 which appears to be 2.6.18.2 + some redhat/fedora
> patches (mostly for Xen, which I'm not running). This sit there for a few
> hours and hadn't thrown an error so I decided to upload some data to it
> overnight ready for the morning. As soon as I did I started getting
> traces for:
>
>
> -----------
> Filesystem "sda5": XFS internal error xfs_btree_check_sblock at line 334 of
> file fs/xfs/xfs_btree.c. Caller 0xffffffff8825e203
>
> Call Trace:
> [<ffffffff802691d9>] show_trace+0x34/0x47
> [<ffffffff802691fe>] dump_stack+0x12/0x17
> [<ffffffff88272bb4>] :xfs:xfs_btree_check_sblock+0xbc/0xcc
> [<ffffffff8825e203>] :xfs:xfs_alloc_lookup+0x14f/0x39a
> [<ffffffff8825bed3>] :xfs:xfs_alloc_ag_vextent+0x74/0xf61
> [<ffffffff8825d116>] :xfs:xfs_alloc_fix_freelist+0x356/0x410
> [<ffffffff8825d54a>] :xfs:xfs_alloc_vextent+0x2ae/0x400
> [<ffffffff8826b578>] :xfs:xfs_bmapi+0xed6/0x1b66
> [<ffffffff8828ba33>] :xfs:xfs_iomap_write_allocate+0x257/0x3fc
> [<ffffffff8828aa3a>] :xfs:xfs_iomap+0x31a/0x521
> [<ffffffff882a38f0>] :xfs:xfs_map_blocks+0x2f/0x5f
> [<ffffffff882a3c46>] :xfs:xfs_page_state_convert+0x2b7/0xb63
> [<ffffffff882a4724>] :xfs:xfs_vm_writepage+0xa7/0xde
> [<ffffffff8021c78f>] mpage_writepages+0x1d0/0x395
> [<ffffffff80259e0f>] do_writepages+0x23/0x32
> [<ffffffff8024e2b8>] __filemap_fdatawrite_range+0x54/0x5e
> [<ffffffff882a779d>] :xfs:fs_flush_pages+0x4b/0x64
> [<ffffffff882a71ec>] :xfs:xfs_file_close+0x2a/0x2e
> [<ffffffff80223b7f>] filp_close+0x36/0x64
> [<ffffffff8021d873>] sys_close+0x8f/0xaa
> [<ffffffff8025c181>] tracesys+0xd1/0xdc
> DWARF2 unwinder stuck at tracesys+0xd1/0xdc
> Leftover inexact backtrace:
> -----------
You've got a corrupt freelist btree block. how were you uploading
files to the machine?
Can you cc bug reports involving XFS to the xfs@....sgi.com list
in future? (added to this reply)
> I first booted into 2.6.18-1.2239.fc5 in single user mode and forced a
> check of the disk with xfs_repair and I'm using xfs-progs-2.8.11 as
> I discovered on my other system that the 2.6.17 XFS kernel driver bugs
> were breaking the FS in a way that the xfs-progs-2.7.x code didn't fix.
>
> These XFS bugs seem to be the same problems that were cropping up in the
> 2.6.17 series which were resolved in 2.6.18.1 (2.6.18-1.2200.fc5).
>
> Any suggestions are greatly appreciated. Also please let me know if more
> details are required.
The 2.6.17 problems can leave on disk corruption that is not tripped
over until some time later on - even after a kernel upgrade.
Running the latest repair over all your XFS filesystems that were in
use on 2.6.17.x (x <= 6) really needs to be done regardless of
whether you've tripped over corruption or not.
However, this could be a result of the problems you've been having
with the aacraid driver, and not an XFS problem at all....
Cheers,
Dave.
> Should I just simply go back to ext3? I'd prefer not to because of the
> fsck'ing time on a 1Tb array, but if it means that the kernel doesn't throw
> a hissy fit then I'll be more than happy to do that.
>
> Regards,
> Colin.
>
> thor# uname -a
> Linux thor 2.6.18-1.2239.fc5 #1 SMP Fri Nov 10 12:51:06
> EST 2006 x86_64 x86_64 x86_64 GNU/Linux
>
> thor# cat /proc/cmdline
> ro root=LABEL=/
>
> Adaptec aacraid driver (1.1-5[2409]-mh2)
>
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 65
> model name : Dual-Core AMD Opteron(tm) Processor 2220 SE
> stepping : 2
> cpu MHz : 2800.000
> cache size : 1024 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
> bogomips : 5639.77
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 40 bits physical, 48 bits virtual
> power management: ts fid vid ttp tm stc
>
>
>
> --
> "Developers are like artists; they produce their best work if they
> have the freedom to do so" - Werner Vogels, CTO Amazon.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists