linux-ext4 - Re: ext4: journal has aborted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140701084206.GG9743@birch.djwong.org>
Date:	Tue, 1 Jul 2014 01:42:06 -0700
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	Matteo Croce <technoboy85@...il.com>,
	David Jander <david@...tonic.nl>
Cc:	linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>
Subject: Re: ext4: journal has aborted

On Tue, Jul 01, 2014 at 08:26:19AM +0200, David Jander wrote:
> 
> Hi,
> 
> On Mon, 30 Jun 2014 23:30:10 +0200
> Matteo Croce <technoboy85@...il.com> wrote:
> 
> > I was web surfing and using gimp when:
> > 
> > EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
> > 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
> 
> I was about to post a related question to this list. I am also seeing these
> kind of errors when using ext4 on latest mainline (I began testing with 3.15
> where I saw this and now in 3.16-rc3 it is still there).
> It happens almost instantly when power-cycling the system (unclean shutdown).
> The next time the system boots, I get these errors.
> 
> AFAICT, you are using a pretty recent kernel. Which version exactly?
> 
> > Aborting journal on device sda2-8.
> > EXT4-fs (sda2): Remounting filesystem read-only

Matteo, could you please post the full dmesg log somewhere?  I'm interested in
what happens before all this happens, because...

> > ------------[ cut here ]------------
> > WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
> > __ext4_handle_dirty_metadata+0x18e/0x1d0()
> > Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
> > snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
> > videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
> > rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
> > aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
> > glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
> > mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
> > snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
> > soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
> > ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
> > backlight
> > CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G         C    3.15.0 #6
> >  0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
> >  ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
> >  ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
> > Call Trace:
> >  [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
> >  [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
> >  [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
> >  [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
> >  [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
> >  [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
> >  [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
> >  [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
> >  [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
> >  [<ffffffff81114a46>] ? evict+0xa6/0x160
> >  [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
> >  [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
> >  [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
> >  [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
> > ---[ end trace 795411398e41fbcb ]---
> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> > line 241, credits 91/91, errcode -30
> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> > line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
> > ext4_free_blocks:4867: Journal has aborted
> > EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
> > block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
> > type 5 started at line 241, credits 91/91, errcode -30
> > EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
> > aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
> > aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> > EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> > EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> 
> I did not get these errors. I suspect this may be a consequence of FS
> corruption due to a bug in etx4.
> 
> Here's why I suspect a bug:
> 
> I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
> eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
> eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
> cause FS corruption in presence of a journal.
> 
> I can format the eMMC device either as EXT3 or EXT4 for the test. After
> formatting and writing the rootfs to the partition I can boot successfully in
> either situation. Once booted from eMMC, I start bonnie++ (to just stress the
> FS for a while), and after a minute or so the board is power-cycled while
> bonnie++ is still running.
> 
> Next time I boot the situation is this:
> 
> With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
> many times as I want, FS stays consistent.
> 
> With EXT4: After just one power cycle I start getting this:
> 
> [    7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
> [    7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

I've been seeing this same set of symptoms with 3.15.0 on various SSDs (Samsung
840 Pro, Crucial M4).  It seems that something (upstart?) is holding open some
file or other during poweroff, which means that the root fs can't be unmounted
or even remounted rw.  I also noticed that the next time the system comes up,
the kernel tells me that it has to process the inode orphan list as part of
recovery.

Shortly after the orphan list gets processed, I get that message and the FS
goes ro.  A subsequent fsck run reveals that the block bitmap is indeed
incorrect in that block group, and when I bd the blocks that are incorrect in
the bitmap, I see what could be some kind of upstart log file.  Either way, I
suspect some bug in orphan processing.

<shrug> I don't know if this is specific to SSDs or spinning rust.  Right now
I've simply rigged the initramfs to e2fsck -p the root fs before mounting it,
which seems(?) to have patched around it for now.

> If I continue the test, it doesn't take long and serious corruption starts
> occurring.

You're getting actual FS data corruption too?  Or more of those messages?

--D
> 
> Again, with EXT3 I am unable to detect any problems.
> 
> Best regards,
> 
> -- 
> David Jander
> Protonic Holland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html