lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090929161250.GX12922@hexapodia.org>
Date:	Tue, 29 Sep 2009 09:12:50 -0700
From:	Andy Isaacson <adi@...apodia.org>
To:	Theodore Tso <tytso@....edu>, linux-kernel@...r.kernel.org,
	linux-ext4@...r.kernel.org
Subject: Re: hard lockup, followed by ext4_lookup: deleted inode referenced: 524788

On Mon, Sep 28, 2009 at 11:13:08PM -0400, Theodore Tso wrote:
> What this indicates to me is that an inode table block was written to
> the wrong location on disk.  In fact, given large numbers of inode
> numbers involved, it looks like large numbers of inode table blocks
> were written to the wrong location on disk.

Aha, sounds like an excellent theory.

> I'm surprised by how many inode tables blocks apparently had gotten
> mis-directed.  Almost certainly there must have been some kind of
> hardware failure that must have triggered this.  I'm not sure what
> caused it, but it does seem like your filesystem has been toasted
> fairly badly.

As I said, the machine hung hard while doing a bunch of writes to a USB
thumbdrive and a kernel compile on sda1.  It could be hardware, but I've
been using this laptop as my primary test box for several months and
it's been fairly reliable (as reliable as git-of-the-day is, pretty
much).

I'll run memtest86 and check SMART.

Note that it is running DMAR (the Intel VT-d iommu implementation), it
could be that a DMA got messed up -- since the logs didn't make it I
don't know if DMAR reported any DMA protection faults at the time of
failure.  The DMAR on this box has had some issues in the past which
seem to be fixed, but ...

> At this point my advice to you would be to try to recover as much data
> from the disk as you can, and to *not* try to run fsck or mount the

Oh, all the data is well backed-up; this is a seriously bleeding-edge
box.

I've taken a complete image of /dev/sda1 and will be reinstalling it.
The image is from after the kernel remounted / RO.

> disk using dd to a backup hard drive first.  If you're really curious
> we could try to look at the dumpe2fs output and see if we can find the
> pattern of what might have caused so many misdirected writes, but
> there's no guarantee that we would be able to find the definitive root
> cause, and from a recovery perspective, it's probably faster and less
> risk to reinstall your system disk from scratch.

I would like to get as close to root cause as possible.  I have a
filesystem image copied away and I'll be attempting to repro the
failure; this is a test system for a large deployment, so I don't want
any issues lurking. :)

Let me know what debug commands you'd like to run.  dumpe2fs output is
at http://web.hexapodia.org/~adi/tmp/dumpe2fs.out

-andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ