linux-kernel - Re: ext4: media error but where?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140704172104.GA4877@xo-6d-61-c0.localdomain>
Date:	Fri, 4 Jul 2014 19:21:04 +0200
From:	Pavel Machek <pavel@....cz>
To:	Theodore Ts'o <tytso@....edu>,
	kernel list <linux-kernel@...r.kernel.org>,
	adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org
Subject: Re: ext4: media error but where?

Hi!

> > pavel@duo:~$ uname -a
> > Linux duo 3.15.0-rc8+ #365 SMP Mon Jun 9 09:18:29 CEST 2014 i686
> > GNU/Linux
> > 
> > EXT4-fs (sda3): error count: 11
> > EXT4-fs (sda3): initial error at 1401714179: ext4_mb_generate_buddy:756
> > EXT4-fs (sda3): last error at 1401714179: ext4_reserve_inode_write:4877
> > 
> > That sounds like media error to me?
> 
> If you search your system logs since the last fsck, you should find 11
> instances of "EXT4-fs error" message, which means that there was some
> file system inconsisntencies detected.  The first error was detected at:
> 
> % date -d @1401714179
> Mon Jun  2 09:02:59 EDT 2014

Interesting. I always assumed 140... was block number.

> ... which means that you haven't rebooted in a month, or your boot
> scripts aren't automatically running fsck, or your clock is
> incorrect.

I suspect something is wrong with the reporting. I got this in kernel log _while
running fsck_. fsck was clean (take a look in the original email). I got weird
report with fsck -c, it told me filesystem modified but I don't think I got bad
blocks there.

I believe my scripts are running fsck automatically, and yes, I rebooted a lot
in a last month. It _may_ be possible that last month this x60 had different hard drive,
and I copied it bit-by-bit.

> It does seem to happen more often after an unclean shutdown, and there
> does seem to be a very high correlation with eMMC devices.  It's
> possible there is a jbd2 bug that got introduced recently, where ext4
> is modifying some field outside of a journal transaction.  But I
> haven't been able to reproduce this yet in controlled circumstances.
> 
> What I need from people reporting problems: 
> 
> * What is the HDD/SSD/eMMC device involved

SATA hdd, will get you exact data.

> * What kernel version were you running

For last month? Various, 3.10 to 3.16-rc, mostly 3.15+.

> * What distribution are you running (more so I know what the init
>   scripts might or might not have been doing vis-a-vis running fsck
>   after a crash)

Debian 6.
 
> * Was there an unclean shutdown / power drop / hard reset involved?
>   If so, did the HDD/SSD/eMMC lose power, or was the reset button hit
>   on the machine?

Crash in last month? Probably yes.

> * What sort of workload / application / test program running before
>   the crash, if any?

Just usual desktop / kernel development.

> and so they don't need to report anymore info.  I need as many data
> points as possible at this point.

You'll get them.

Is it possible that my fsck is so old it does not clear this "filesystem
had error in past" flag? Because I strongly suspect I'll boot into
init=/bin/bash, run fsck, it will tell me "all clean", and the messages
will repeat in the middle of fsck run.

Best regards,
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/