linux-kernel - Re: Repeated XFS corruption -Corruption of in-memory data detected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20070731015300.GM31489@sgi.com>
Date:	Tue, 31 Jul 2007 11:53:00 +1000
From:	David Chinner <dgc@....com>
To:	Ryan Bair <ryandbair@...il.com>
Cc:	linux-kernel@...r.kernel.org, xfs@....sgi.com
Subject: Re: Repeated XFS corruption -Corruption of in-memory data detected

[cc xfs@....sgi.com]

On Mon, Jul 30, 2007 at 12:10:52PM -0400, Ryan Bair wrote:
> Kernel: 2.6.18-4-amd64 (Debian 2.6.18.dfsg.1-12etch2) Debian Etch
> System: Dell PowerEdge 1850
> Processor: 3.2 GHz Intel Xeon w/ microcode v1.14a, Hyperthreading disabled.
> RAM: 2x1GB ECC DDR-400
> RAID Controller: Dell PERC5/E using megaraid driver
> 
> I got another unexpected error on my XFS partition today. I was able
> to reboot the system normally and the journal recovered on the
> following mount. Shortly thereafter, the error occurred again. After
> this the filesystem was no longer able to be mounted as the error
> would occur immediately.
> 
> The volume is on a 9.5TB LVM2 volume on a Dell MD1000 loaded with 15
> 750GB drives in a RAID5 set. Writeback is disabled. Memtest86+ was run
> on this system for 48 hours without fault. The system is otherwise
> stable.

<sigh>

You're the second person today to report a software RAID5+XFS corruption on
the 2.6.18-4 Debian kernel. Almost the same signature as well - that is a
corrupted free space btree.

> XFS was able to repair the damage, but previously the drive returned
> to its corrupted state within a few hours of heavy I/O.

The other report was a shutdown before corruption got to disk,
so maybe they are different problems.

Can you post the repair output so we can see what the damage was?
Also, can you post your md/dm config so I can see if I can recreate
a similar config?

Also, seeing as the previous report was caught before corruption
got to disk, I suspected memory corruption of some kind. Can
you enable slab, vm and filesystem debugging for you kernel and
run with that?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/