linux-ext4 - Re: hard lockup, followed by ext4_lookup: deleted inode referenced: 524788

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090929161250.GX12922@hexapodia.org>
Date:	Tue, 29 Sep 2009 09:12:50 -0700
From:	Andy Isaacson <adi@...apodia.org>
To:	Theodore Tso <tytso@....edu>, linux-kernel@...r.kernel.org,
	linux-ext4@...r.kernel.org
Subject: Re: hard lockup, followed by ext4_lookup: deleted inode referenced: 524788

On Mon, Sep 28, 2009 at 11:13:08PM -0400, Theodore Tso wrote:
> What this indicates to me is that an inode table block was written to
> the wrong location on disk.  In fact, given large numbers of inode
> numbers involved, it looks like large numbers of inode table blocks
> were written to the wrong location on disk.

Aha, sounds like an excellent theory.

> I'm surprised by how many inode tables blocks apparently had gotten
> mis-directed.  Almost certainly there must have been some kind of
> hardware failure that must have triggered this.  I'm not sure what
> caused it, but it does seem like your filesystem has been toasted
> fairly badly.

As I said, the machine hung hard while doing a bunch of writes to a USB
thumbdrive and a kernel compile on sda1.  It could be hardware, but I've
been using this laptop as my primary test box for several months and
it's been fairly reliable (as reliable as git-of-the-day is, pretty
much).

I'll run memtest86 and check SMART.

Note that it is running DMAR (the Intel VT-d iommu implementation), it
could be that a DMA got messed up -- since the logs didn't make it I
don't know if DMAR reported any DMA protection faults at the time of
failure.  The DMAR on this box has had some issues in the past which
seem to be fixed, but ...

> At this point my advice to you would be to try to recover as much data
> from the disk as you can, and to *not* try to run fsck or mount the

Oh, all the data is well backed-up; this is a seriously bleeding-edge
box.

I've taken a complete image of /dev/sda1 and will be reinstalling it.
The image is from after the kernel remounted / RO.

> disk using dd to a backup hard drive first.  If you're really curious
> we could try to look at the dumpe2fs output and see if we can find the
> pattern of what might have caused so many misdirected writes, but
> there's no guarantee that we would be able to find the definitive root
> cause, and from a recovery perspective, it's probably faster and less
> risk to reinstall your system disk from scratch.

I would like to get as close to root cause as possible.  I have a
filesystem image copied away and I'll be attempting to repro the
failure; this is a test system for a large deployment, so I don't want
any issues lurking. :)

Let me know what debug commands you'd like to run.  dumpe2fs output is
at http://web.hexapodia.org/~adi/tmp/dumpe2fs.out

-andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html