linux-ext4 - Re: [Suggested Ex4 Patch] JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 1766784). There's a risk of filesystem corruption in case of system crash.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACM3AawAJMKUYo1SA9PcacvdznJbm4pv9OBG27MEeC_0X_e+pA@mail.gmail.com>
Date:   Tue, 16 May 2017 16:30:04 +0100
From:   kiki good <jqiaoulk@...il.com>
To:     linux-ext4@...r.kernel.org
Subject: Re: [Suggested Ex4 Patch] JBD2: Spotted dirty metadata buffer (dev =
 sda1, blocknr = 1766784). There's a risk of filesystem corruption in case of
 system crash.

Hi:

It might be fixed at this commit :
https://github.com/torvalds/linux/commit/3b136499e906460919f0d21a49db1aaccf0ae963

On Thu, May 11, 2017 at 11:41 PM, kiki good <jqiaoulk@...il.com> wrote:
> Hi:
>
>
> I recently see the error message below when doing the following mount
> option in hard drive, after a while, Kernel reports BUG_ON.
>
> The standard kernel version is 3.10.27. This issue can been seen in
> many other customer devices with this kernel version.
>
>
> There are several things worthwhile to mention beforehand:
>
> 1. I don't believe its a hardware related issue since it could be
> duplicated in our many devices.
>
> 2. I don't think it is a real Filesystem corruption since these box
> never reboot or remount during the test.
>
> 3. Devices are under heavy memory pressure when this issue occur,
>
> 4. It is not straightforward and might involve tremendous work to
> switch to the latest kernel version.
>
>
> Mount Option:
>
> /dev/sda1 on /mnt/ type ext4
> (rw,nosuid,relatime,nodelalloc,journal_checksum,nobarrier,data=journal)
>
>
>
> "
>
> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 1766784).
> There's a risk of filesystem corruption in case of system crash.
>
> ...........................................................
>
> kernel BUG at fs/jbd2/commit.c:1059!
> [<801b0e10>] (jbd2_journal_commit_transaction+0x1304/0x177c) from
> [<801b569c>] (kjournald2+0xa8/0x248)
> [<801b569c>] (kjournald2+0xa8/0x248) from [<800414c8>] (kthread+0xa0/0xac)
> [<800414c8>] (kthread+0xa0/0xac) from [<8000dc98>] (ret_from_fork+0x14/0x3c)
>
> "
>
> The code is listed below at Line 1059:
>
>
> jbd2_journal_commit_transaction
>
> {
>
>          .....................................................
>
> J_ASSERT_BH(bh, !buffer_dirty(bh));
> /*
> * The buffer on BJ_Forget list and not jbddirty means
> * it has been freed by this transaction and hence it
> * could not have been reallocated until this
>
> }
>
>
> So, I firstly get the call stack to understand why JBD2 reports
> Spotted dirty metadata buffer. The dump stack is listed below:
>
>
> 4,173710,227778211328,-;[<80015cf0>] (unwind_backtrace+0x0/0x118) from
> [<80011f98>] (show_stack+0x10/0x14)
> 4,173711,227778220057,-;[<80011f98>] (show_stack+0x10/0x14) from
> [<80013b5c>] (ipi_backtrace+0x70/0xa8)
> 4,173712,227778228492,-;[<80013b5c>] (ipi_backtrace+0x70/0xa8) from
> [<80013cf0>] (arch_trigger_all_cpu_backtrace+0x40/0xe0)
> 4,173713,227778238689,-;[<80013cf0>]
> (arch_trigger_all_cpu_backtrace+0x40/0xe0) from [<801ac99c>]
> (warn_dirty_buffer+0x2c/0x38)
> 4,173714,227778249227,-;[<801ac99c>] (warn_dirty_buffer+0x2c/0x38)
> from [<801adcc4>] (__jbd2_journal_file_buffer+0xa8/0x1d0)
> 4,173715,227778259500,-;[<801adcc4>]
> (__jbd2_journal_file_buffer+0xa8/0x1d0) from [<801ae8cc>]
> (jbd2_journal_dirty_metadata+0x1ec/0x258)
> 4,173716,227778270828,-;[<801ae8cc>]
> (jbd2_journal_dirty_metadata+0x1ec/0x258) from [<8018f680>]
> (__ext4_handle_dirty_metadata+0xe0/0x190)
> 4,173717,227778282332,-;[<8018f680>]
> (__ext4_handle_dirty_metadata+0xe0/0x190) from [<8016949c>]
> (write_end_fn+0x50/0x7c)
> 4,173718,227778292340,-;[<8016949c>] (write_end_fn+0x50/0x7c) from
> [<8016a860>] (ext4_walk_page_buffers+0x88/0xa0)
> 4,173719,227778301741,-;[<8016a860>]
> (ext4_walk_page_buffers+0x88/0xa0) from [<8016ee8c>]
> (ext4_journalled_write_end+0x160/0x308)
> 4,173720,227778312462,-;[<8016ee8c>]
> (ext4_journalled_write_end+0x160/0x308) from [<800afd44>]
> (generic_file_buffered_write+0x154/0x224)
> 4,173721,227778323775,-;[<800afd44>]
> (generic_file_buffered_write+0x154/0x224) from [<800b146c>]
> (__generic_file_aio_write+0x350/0x3a0)
> 4,173722,227778334999,-;[<800b146c>]
> (__generic_file_aio_write+0x350/0x3a0) from [<800b1510>]
> (generic_file_aio_write+0x54/0xb4)
> 4,173723,227778345615,-;[<800b1510>]
> (generic_file_aio_write+0x54/0xb4) from [<80165478>]
> (ext4_file_write+0x348/0x430)
> 4,173724,227778355453,-;[<80165478>] (ext4_file_write+0x348/0x430)
> from [<800ef3ec>] (do_sync_readv_writev+0x74/0x98)
> 4,173725,227778365115,-;[<800ef3ec>] (do_sync_readv_writev+0x74/0x98)
> from [<800f01e8>] (do_readv_writev+0xd4/0x200)
> 4,173726,227778374690,-;[<800f01e8>] (do_readv_writev+0xd4/0x200) from
> [<800f03dc>] (vfs_writev+0x58/0x70)
> 4,173727,227778383399,-;[<800f03dc>] (vfs_writev+0x58/0x70) from
> [<800f0494>] (SyS_writev+0x38/0x68)
> 4,173728,227778391586,-;[<800f0494>] (SyS_writev+0x38/0x68) from
> [<8000dbc0>] (ret_fast_syscall+0x0/0x30)
>
> Based on the comment of the JBD2 code in __jbd2_journal_file_buffer, I
> quote "For metadata buffers, we track dirty bit in buffer_jbddirty
> instead of buffer_dirty. We should not see a dirty bit set here
> because we clear it in do_get_write_access but e.g.
> tune2fs can modify the sb and set the dirty bit at any time so we try
> to gracefully handle that."
> it means jbd2 code shouldn't set this buffer_dirty anyway.
>
> and Another comment in do_get_write_access() in transaction.c, it says
> more, i quote " it is journaled, and we don't expect dirty buffers
> in that state (the buffers should be marked JBD_Dirty instead.)  So
> either the IO is being done under our own control and this is a bug,
> or it's a third party IO such as dump(8) (which may leave the buffer
> scheduled for read --- ie. locked but not dirty) or tune2fs (which may
> actually havethe buffer dirtied, ugh.)
> "
>
> In our case, the dirty bit of buffer_head is changed between
> ext4_write_begin() and ext4_journalled_write_end() as bleow:
>
> "
> ext4_write_begin() --> BH dirty bit is 0
>
> ext4_journalled_write_end() --> BH dirty bit is changed to 1 !!! and
> error is reported
> "
>
> Then I do an experiment change as below in ext4_journalled_write_end()
> to see whether the original issue could be duplicated.
> After almost a week test, the issue doesn't appear. Therefore, it
> seems there are some race conditions that change the value
> of BH dirty between ext4_write_begin() and
> ext4_journalled_write_end(). (Note: do_journal_get_write_access has
> already been
> called at ext4_write_begin )
>
> +               ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
> +                                            to,
> &partial,do_journal_get_write_access);
> +
>                 ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
>                                              to, &partial, write_end_fn);
>
>
>
> After reviewing the code, it looks although
> do_journal_get_write_access() imposes lock on buffer header, it
> doesn't lock the whole routine between ext4_write_begin() and
> ext4_journalled_write_end(). so i think the buffer head should be
> locked at the time until ext4_journalled_write_end() is finished.
> Initially, I imagine that I can do the following change but then I
> realise a existing buffer lock has already been placed in
> jbd2_journal_get_write_access, so making a big buffer lock will
> quickly become a performance issue and
> also involves a lot of changes.
>
> + lock all the buffer headers belonging to this page
>
> ext4_write_begin()
> .....................................................
>  ext4_journalled_write_end()
>
> + unlock all the buffer headers belonging to this page
>
> Therefore, I propose the following change for simplicity and performance:
>
> --- /linux-3.10.27_old/fs/ext4/inode.c 2014-01-15 23:29:14.000000000 +0000
> +++/linux-3.10.27_new/fs/ext4/inode.c 2017-05-11 22:47:58.986146486 +0100
> @@ -1084,10 +1084,18 @@
>   int ret;
>   if (!buffer_mapped(bh) || buffer_freed(bh))
>   return 0;
> +
> + lock_buffer();
> + if (buffer_dirty(bh)) {
> +     clear_buffer_dirty(bh);
> + }
> +
>   set_buffer_uptodate(bh);
>   ret = ext4_handle_dirty_metadata(handle, NULL, bh);
>   clear_buffer_meta(bh);
>   clear_buffer_prio(bh);
> +
> + unlock_buffer();
>   return ret;
>  }
>
> Thanks
> Jmqiao