linux-ext4 - ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 31 Jul 2014 12:51:38 -0700
From:	Andy Isaacson <adi@...apodia.org>
To:	Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block
 bitmap corrupt

3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on
Thinkpad T440s.

System has been quite stable for ~9 months, always running a very recent
stable tree.

kernel panicked this morning probably due to an external drive
triggering UAS errors in 3.15 (but the syslog didn't make it to disk
alas).  The system remained powered on for >30 seconds after the panic,
finally I shut down by holding down the power button.  So there should
not have been any writes in flight to the SSD.

After reboot, rootfs was deeply unhappy:

[    7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[    7.248404] EXT4-fs (dm-1): write access will be enabled during recovery
[    7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs
[    7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted
[    7.326280] EXT4-fs (dm-1): recovery complete
[    7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
...
[    8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
...
[   39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt.
[   39.354389] Aborting journal on device dm-1-8.
[   39.354478] EXT4-fs (dm-1): Remounting filesystem read-only
[   39.354485] ------------[ cut here ]------------
[   39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]()
[   39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_sdmmc mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys
[   39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19
[   39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013
[   39.354602]  0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000
[   39.354605]  ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0
[   39.354608]  00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0
[   39.354611] Call Trace:
[   39.354617]  [<ffffffff81378c2a>] dump_stack+0x45/0x56
[   39.354621]  [<ffffffff8103dc62>] warn_slowpath_common+0x7f/0x98
[   39.354643]  [<ffffffffa03a3d33>] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
[   39.354648]  [<ffffffff8103dd2e>] warn_slowpath_null+0x1a/0x1c
[   39.354666]  [<ffffffffa03a3d33>] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
[   39.354686]  [<ffffffffa03aa380>] ext4_free_blocks+0x713/0x809 [ext4]
[   39.354704]  [<ffffffffa03a0639>] ext4_ext_remove_space+0x698/0xbdc [ext4]
[   39.354723]  [<ffffffffa03af7b1>] ? __es_remove_extent+0x46/0x27d [ext4]
[   39.354741]  [<ffffffffa03a246f>] ext4_ext_truncate+0x89/0xad [ext4]
[   39.354756]  [<ffffffffa0383024>] ext4_truncate+0x199/0x281 [ext4]
[   39.354770]  [<ffffffffa038379b>] ext4_evict_inode+0x1a7/0x2d0 [ext4]
[   39.354775]  [<ffffffff8113f390>] evict+0xa8/0x14c
[   39.354778]  [<ffffffff8113fa75>] iput+0x12d/0x136
[   39.354783]  [<ffffffff81136d5b>] do_unlinkat+0x14e/0x1f4
[   39.354788]  [<ffffffff8112bfe9>] ? ____fput+0xe/0x10
[   39.354794]  [<ffffffff8105659d>] ? task_work_run+0x87/0x98
[   39.354798]  [<ffffffff81137b98>] SyS_unlinkat+0x29/0x2b
[   39.354802]  [<ffffffff81137b98>] ? SyS_unlinkat+0x29/0x2b
[   39.354807]  [<ffffffff8137d0d2>] system_call_fastpath+0x16/0x1b
[   39.354810] ---[ end trace 80365b8da4738adc ]---
[   39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30
[   39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[   39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted
[   39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[   39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[   39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted
[   39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted
[   39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[   39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted
[   39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[   39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted


Rebooted again and rootfs came up dirty, of course, but journal seems
sadder than expected:

[   12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
[   12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
[   12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1
[   12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146
[   12.506527] Aborting journal on device dm-1-8.
[   12.506950] EXT4-fs (dm-1): Remounting filesystem read-only
[   12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure
[   12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt.
[   12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt.


fsck claims to have fixed it but on reboot it blows up the same way:

e2fsck 1.42.11 (09-Jul-2014)
/dev/mapper/t440s-root: recovering journal
/dev/mapper/t440s-root contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Unconnected directory inode 3801092 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801093 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801106 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801107 (/lost+found/#3801106/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801111 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801116 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801118 (/tmp/???)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 3801089 ref count is 61, should be 42.  Fix<y>? yes
Inode 3801092 ref count is 3, should be 2.  Fix<y>? yes
Inode 3801093 ref count is 3, should be 2.  Fix<y>? yes
Unattached inode 3801099
Connect to /lost+found<y>? yes
Inode 3801099 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 3801103
Connect to /lost+found<y>? yes
Inode 3801103 ref count is 2, should be 1.  Fix<y>? yes
Inode 3801106 ref count is 3, should be 2.  Fix<y>? yes
Inode 3801107 ref count is 3, should be 2.  Fix<y>? yes
Inode 3801111 ref count is 3, should be 2.  Fix<y>? yes
Unattached inode 3801112
Connect to /lost+found<y>? yes
Inode 3801112 ref count is 2, should be 1.  Fix<y>? yes
Inode 3801116 ref count is 3, should be 2.  Fix<y>? yes
Inode 3801118 ref count is 3, should be 2.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850
Fix<y>? yes
Free blocks count wrong for group #464 (24160, counted=24180).
Fix<y>? yes
Free blocks count wrong for group #465 (25520, counted=25827).
Fix<y>? yes
Free blocks count wrong for group #835 (18809, counted=18745).
Fix<y>? yes
Free blocks count wrong for group #837 (23154, counted=23024).
Fix<y>? yes
Free blocks count wrong for group #2506 (28536, counted=28535).
Fix<y>? yes
Free blocks count wrong for group #2842 (2415, counted=2478).
Fix<y>? yes
Free blocks count wrong for group #2844 (27816, counted=28135).
Fix<y>? yes
Free blocks count wrong (108044209, counted=108044918).
Fix<y>? yes
Inode bitmap differences:  -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(3801896--3801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907
Fix<y>? yes
Free inodes count wrong for group #464 (6581, counted=6696).
Fix<y>? yes
Directories count wrong for group #464 (366, counted=346).
Fix<y>? yes
Free inodes count wrong (29348331, counted=29348445).
Fix<y>? yes

/dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/t440s-root: ***** REBOOT LINUX *****
/dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks


After fsck reports clean, reboot still shows failures:


[    7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[    7.378365] EXT4-fs (dm-1): write access will be enabled during recovery
[    7.384663] EXT4-fs (dm-1): recovery complete
[    7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)

[    7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro

[    9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt.
[    9.820975] Aborting journal on device dm-1-8.
[    9.821614] EXT4-fs (dm-1): Remounting filesystem read-only


Similar repeated problems repeat on every reboot.

SMART stats on the SSD do not indicate any signs of failing hardware:

Device Model:     Samsung SSD 840 EVO 500GB
Serial Number:    S1DHNSAD929048M
LU WWN Device Id: 5 002538 8a00452f8
Firmware Version: EXT0BB0Q
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jul 31 12:36:59 2014 PDT
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1693
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       165
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       2
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   069   053   000    Old_age   Always       -       31
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       7
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       2102932957

-andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html