lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 4 Dec 2012 10:09:28 -0500
From:	Theodore Ts'o <tytso@....edu>
To:	Li Zefan <lizefan@...wei.com>
Cc:	Eric Sandeen <sandeen@...hat.com>,
	Yafang Shao <laoar.shao@...il.com>,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	wuqixuan@...wei.com, wuqixuan@...il.com
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote:
> 
> I've collected some logs in different machines, and the error was always
> triggered in ext3_readdir:
> 
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0

This looks like the last part of the inode was zapped.  It might be
worth adding a kernel patch which dumps out the entire directory block
as a hex dump when this triggers --- and then compare it to what you
get if you dump the directory back out after the machine reboot.  That
might given you a hint if something is corrupting the directory block
in memory.  (especially if you set the remount read-only option).

> The last two errors happened on the same machine, and the same inode! One
> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.

If it's always the same inode, you might want to correlate based on
the pathname.  Is there any commonality accross multiple machines in
terms of the directory name, and what application(s) might be touching
that directory?

> Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines
> had filesystem corrupted. However they won't stop upgrading other machines!
> 
> On the other hand, we can hardly reproduce this bug in the lab.

This is why wise cloud companies have a (figurative) big red button to
stop upgrade rollouts (which are always done slowly and gradually),
and processes which make it relatively easy for engineers to be able
to push the "big red button".  I seem to recall the operations
engineer at Facebook giving a talk where he mentioned this.  :-)

Good luck!  Sorry, the pattern of corruption really doesn't sound
familiar to me...

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists