linux-kernel - Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADDb1s3HYDvb51Ngrwk82gkpbUWg1bRo7kaUmbGRmb0g_9JKgw@mail.gmail.com>
Date:	Wed, 16 Apr 2014 10:30:10 +0530
From:	Amit Sahrawat <amit.sahrawat83@...il.com>
To:	"Theodore Ts'o" <tytso@....edu>,
	Amit Sahrawat <amit.sahrawat83@...il.com>,
	Jan Kara <jack@...e.cz>, darrick.wong@...cle.com,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	LKML <linux-kernel@...r.kernel.org>,
	Namjae Jeon <linkinjeon@...il.com>
Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

Thanks Ted, for the detailed reply.

We could not retain the original HDD – on which we got the issue. In
order to replicate the problem we did steps like this:
1)	Make 250MB partition
2)	Format the partition with blocksize ‘4K’
3)	Using dumpe2fs – get the block group information
Group 0: (Blocks 0-32767)
  Checksum 0x89ae, unused inodes 29109
  Primary superblock at 0, Group descriptors at 1-1
  Reserved GDT blocks at 2-15
  Block bitmap at 16 (+16), Inode bitmap at 32 (+32)
  Inode table at 48-957 (+48)
  26826 free blocks, 29109 free inodes, 2 directories, 29109 unused inodes
  Free blocks: 5942-32767
  Free inodes: 12-29120
Group 1: (Blocks 32768-58226) [INODE_UNINIT]
  Checksum 0xb43a, unused inodes 29120
  Backup superblock at 32768, Group descriptors at 32769-32769
  Reserved GDT blocks at 32770-32783
  Block bitmap at 17 (bg #0 + 17), Inode bitmap at 33 (bg #0 + 33)
  Inode table at 958-1867 (bg #0 + 958)
  25443 free blocks, 29120 free inodes, 0 directories, 29120 unused inodes
  Free blocks: 32784-58226
  Free inodes: 29121-58240
4)	Corrupt the block group ‘1’  by writing all ‘1’, we had one file
with all 1’s, so using ‘dd’ –
dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
After this mount the partition – create few random size files and then
ran ‘fsstress,

fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

and we get logs like before the hang:
#fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
seed = 582332
EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

EXT4-fs (sdb1): delayed block allocation failed for inode 26 at
logical offset 181 with max blocks 2 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=51
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=9
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 101 at
logical offset 198 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=43
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=1
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 94 at
logical offset 450 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=36
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=12
EXT4-fs (sdb1): i_reserved_meta_blocks=2


…
EXT4-fs (sdb1): error count: 3
EXT4-fs (sdb1): initial error at 545: ext4_mb_generate_buddy:743
EXT4-fs (sdb1): last error at 42: ext4_mb_generate_buddy:743
…
Yes, we are running this on ARM target and we did not set the time
before running these operations. So, the time actually corresponds to
date -d @545
Thu Jan  1 05:39:05 IST 1970
date -d @42
Thu Jan  1 05:30:42 IST 1970


Yes, we are running kernel version 3.8 and applied the patches from
‘Darrick’ to fix the first looping around in ext4_da_writepages.

As you suggested, we also made change to mark the FS READ ONLY in case
of corruption. The changes are;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c0fbd96..04f3a66 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1641,8 +1641,10 @@ static void mpage_da_map_and_submit(struct
mpage_da_data *mpd)
                                 mpd->b_size >> mpd->inode->i_blkbits, err);
                        ext4_msg(sb, KERN_CRIT,
                                "This should not happen!! Data will be lost\n");
-                       if (err == -ENOSPC)
+                       if (err == -ENOSPC) {
                                ext4_print_free_blocks(mpd->inode);
+                               mpd->inode->i_sb->s_flags |= MS_RDONLY;
+                       }
                }
                /* invalidate all the pages */
                ext4_da_block_invalidatepages(mpd);
@@ -2303,6 +2305,36 @@ out:
        return ret;
 }

+static void ext4_invalidate_mapping(struct  address_space *mapping)
+{
+       struct pagevec pvec;
+       unsigned int i;
+       pgoff_t index = 0;
+       struct inode* inode = mapping->host;
+
+       pagevec_init(&pvec, 0);
+       while (pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY,
+                                 PAGEVEC_SIZE)) {
+               for (i = 0; i < pagevec_count(&pvec); i++) {
+                       struct page *page = pvec.pages[i];
+                       lock_page(page);
+                       spin_lock_irq(&mapping->tree_lock);
+                       if (test_bit(PG_dirty, &page->flags)) {
+                               printk(KERN_ERR"INO[%lu], i(%d)\n",
inode->i_ino, i);
+                               radix_tree_tag_clear(&mapping->page_tree,
+                               page_index(page),
+                               PAGECACHE_TAG_DIRTY);
+                               spin_unlock_irq(&mapping->tree_lock);
+                               clear_page_dirty_for_io(page);
+                               goto unlock_page;
+                       }
+                       spin_unlock_irq(&mapping->tree_lock);
+unlock_page:
+                       unlock_page(page);
+               }
+               pagevec_release(&pvec);
+       }
+}
static int ext4_da_writepages(struct address_space *mapping,
                              struct writeback_control *wbc)
@@ -2419,6 +2451,14 @@ retry:
                        ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
                               "%ld pages, ino %lu; err %d", __func__,
                                wbc->nr_to_write, inode->i_ino, ret);
+
+                       if(ret == -EROFS) {
+                               ext4_msg(inode->i_sb, KERN_CRIT,"This
is probably result of corruption"
+                                                "that FS is marked RO
in between writepages\n");
+                               /*Invalidate mapping if RO encountered
during writepages */
+                                ext4_invalidate_mapping(mapping);
+                                invalidate_inode_pages2(mapping);
+                       }
                        blk_finish_plug(&plug);
                        goto out_writepages;
                }

After making these changes, as soon as there is corruption – the FS is
marked RO and all pending DIRTY writes are cleared. So, the hang is
not seen.

But we have a doubt about the possible side-effects of this code. Can
you please provide your inputs on this? It will be of immense help.

Thanks & Regards,
Amit Sahrawat


On Tue, Apr 15, 2014 at 6:17 PM, Theodore Ts'o <tytso@....edu> wrote:
> On Tue, Apr 15, 2014 at 01:24:26PM +0530, Amit Sahrawat wrote:
>> Initially in normal write path, when the disk was almost full – we got
>> hung for the ‘sync’ because the flusher (which is busy in the
>> writepages is not responding). Before the hung task, we also found the
>> logs like:
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>
> These errors indicate that the several block groups contain have
> corrupt block bitmap: block group #1493, #1000, and #1425.  The fact
> that there are 0 free blocks/clusters in the bitmap indicate that the
> bitmap was all zero's, which could be the potential cause of the
> corruption.
>
> The other thing which is funny is the number of free blocks/clusters
> being greater than 32768 in block group #1493.  Assuming a 4k page
> size, that shouldn't be possible.  Can you send the output of
> "dumpe2fs -h /dev/sdXX" so we can take a look at the file system parameters?
>
> How much before the hung task did you see these messages?  I normally
> recommend that the file system be set up to either panic the system,
> or force the file system to be remounted read/only when EXT4-fs error
> messages are reported, since that means that the file system is
> corrupted, and further operaion can cause more data to be lost.
>
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> EXT4-fs (sda1): error count: 58
>> EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
>> EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742
>
> The "607" and "58" in the "at 607" and "at 58" are normally supposed
> to be a unix time_t value.  That is, it's normally a number like:
> 1397564866, and it can be decoded via:
>
> % date -d @1397564866
> Tue Apr 15 08:27:46 EDT 2014
>
> The fact that these numbers are numerically so small means that the
> time wasn't set correctly on your system.  Was this a test system
> running under kvm without a proper real-time clock?
>
>> When we analysed the problem, it occurred from the writepages path in ext4.
>> This is because of the difference in the free blocks reported by
>> cluster bitmap and the number of free blocks reported by group
>> descriptor.
>
> Yes, indicating that the file system was corrupt.
>
>> During ext4_fill_super, ext4 calculates the number of free blocks by
>> reading all the descriptors in function ext4_count_free_clusters and
>> store it in percpu counter s_freeclusters_counter.
>> ext4_count_free_clusters:
>> desc_count = 0;
>> for (i = 0; i < ngroups; i++) {
>> gdp = ext4_get_group_desc(sb, i, NULL);
>> if (!gdp)
>> continue;
>> desc_count += ext4_free_group_clusters(sb, gdp);
>> }
>> return desc_count;
>>
>> During writebegin call, ext4 checks this s_freeclusters_counter
>> counter to know if there are free blocks present or not.
>> When the free blocks reported by group descriptor are greater than the
>> actual free blocks reported by bitmap, a call to writebegin could
>> still succeed even if the free blocks represented by bitmaps are 0.
>
> Yes.  We used to have code that would optionally read every single
> bitmap, and verify that the descriptor counts match the values in the
> bitmap.  However, that was expensive, and wasn't a full check of all
> possible file system inconsistencies that could lead to data loss.  So
> we ultimately removed this code.  If the file system is potentially
> corrupt, it is the system administrator's responsibility to force an
> fsck run to make sure the file system data structures are consistent.
>
>> When searching for the relevant problem which occurs in this path. We
>> got the patch-set from ‘Darrick’ which revolves around this problem.
>> ext4: error out if verifying the block bitmap fails
>> ext4: fix type declaration of ext4_validate_block_bitmap
>> ext4: mark block group as corrupt on block bitmap error
>> ext4: mark block group as corrupt on inode bitmap error
>> ext4: mark group corrupt on group descriptor checksum
>> ext4: don't count free clusters from a corrupt block group
>>
>> After adopting the patch-set and performing verification on the
>> similar setup, we ran ‘fsstress’. But now it is resulting in hang at
>> different points.
>>
>> In the current logs we got:
>> EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
>> 20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
>> JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>
> OK, what version of the kernel are you using?  The patches that you
> reference above have been in the upstream kernel since 3.12, so I'm
> assuming you're not using the latest upstream kernel, but rather an
> older kernel with some patches applied.  Hmmm, skipping ahead:
>
>> Kernel Version: 3.8
>> Test command:
>> fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
>
> There is clearly either some kernel bug or hardware problem which is
> causing the file system corruption.  Given that you are using a much
> older kernel, it's quite likely that there is some bug that has been
> fixed in a later version of the kernel (although we can't really rule
> out a hardware problem without know much more about your setup).
>
> Unfortunately, there has been a *large* number of changes since
> version 3.8, and I can't remember all of the changes and bug fixes
> that we might have made in the past year or more (v3.8 dates from
> March 2013).
>
> Something that might be helpful is for you to use xfstests.  That's a
> much more thorough set of tests which we've been using so if you must
> use an antique version of the kernel, that will probably be a much
> better set of tests.  It includes fsstress, and much more besides.
> More importantly, there are times when fixes are identified by the
> xfstest failure that has gotten fixed up in the commit logs.  So that
> might help you find the bug fix that you need to backport.
>
> For your convenience, there is a simple test framework that makes it
> relatively easy to build and run xfstests under KVM.  You can find it
> here:
>
>         git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git
>
> See the documentation found at:
>
>         https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/README
>
> for more details.
>
> I hope this helps,
>
>                                                 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/