linux-ext4 - Re: Should we discard jbddirty bit if BH

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7bb361261001292241t7dda80d8sc5f467677ad34954@mail.gmail.com>
Date:	Sat, 30 Jan 2010 14:41:07 +0800
From:	丁定华 <dingdinghua85@...il.com>
To:	Jan Kara <jack@...e.cz>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: Should we discard jbddirty bit if BH_Freed is set?

Hi,
   The attached is a patch to fix this problem,
please have a look.

2010/1/29 丁定华 <dingdinghua85@...il.com>:
> Hi,
>    I think what we need to do is to distinguish the buffers which can
> be discarded after
> this transaction commits and which we must wait until the next transaction
> commits, and as you wrote, set b_next_transaction to running
> transaction and file them
> into t_forget list is a good way.
>
> 2010/1/28 Jan Kara <jack@...e.cz>:
>>  Hi,
>>
>> On Thu 28-01-10 09:23:43, 丁定华 wrote:
>>>     As you wrote, if T2!=T1, then T2 is committing transaction while T1
>>>     is running transaction, and if T1 complete commit, we don't care
>>>     about the content of buffers.  But there is a  prerequisite -->"T1
>>>     complete commit", if T1 start commit and another transaction T3
>>>     becomes the new running transaction, T3 may need to reuse T2 log
>>>     space and force checkpoint, and since we have clean the BH_dirty bit
>>>     of buffers after T2 commits, so T2 may be freed before T1 complete
>>>     commit, and unfortunately, T1 doesn't complete commit, so after
>>>     replay, updates of T2 get lost, fs becomes inconsistent.
>>  Hmm, you are right. That could probably happen. It only hits ext3/4 in
>> data=journaled mode but the bug is there. But it's hard to fix it in a
>> reasonable way - if we would just leave the dirty bit set, we will have
>> aliasing problems - buffer B is attached to some page which used to be from
>> file F, so unmap_underlying_metadata will not find it because it looks only
>> into page cache the block device, not to the pagecaches of individual
>> files. So if we reallocate the block of B for some other file G before the
>> buffer B is checkpointed, we have two dirty buffers representing the same
>> block and thus data corruption can happen.
>>  Maybe we could handle them by setting b_next_transaction to the
>> transaction that deleted the buffer (in jbd2_journal_unmap_buffer) and
>> setting buffer freed like we do now. Commit code would handle freed
>> buffers like:
>> If b_next_transaction is set, file buffer to forget list of the next
>> transaction. If b_next_transaction isn't set, clear all dirty bits.
>> What do you think?
>>
>>                                                                Honza
>>
>>> 2010/1/27 Jan Kara <jack@...e.cz>:
>>> >  Hi,
>>> >
>>> > On Wed 27-01-10 10:32:18, 丁定华 wrote:
>>> >>         I'm a little confused about BH_Freed bit. The only place it is set
>>> >> is journal_unmap_buffer, which is called by jbd2_journal_invalidatepage when
>>> >> we want to truncate a file. Since jbd2_journal_invalidatepage is called
>>> >> outside of transaction, We can't make sure whether the "add to orphan"
>>> >> operation belongs to committing transaction or not,  so we can't touch the
>>> >> buffer belongs to committing transaction, instead BH_Freed bit is set to
>>> >> indicate that this buffer can be discarded in running transaction. But i
>>> >> think we shouldn't clear BH_JBDdirty in jbd2_journal_commit_transaction, as
>>> >> following codes does:
>>> >>                 /* A buffer which has been freed while still being
>>> >>                  * journaled by a previous transaction may end up still
>>> >>                  * being dirty here, but we want to avoid writing back
>>> >>                  * that buffer in the future now that the last use has
>>> >>                  * been committed.  That's not only a performance gain,
>>> >>                  * it also stops aliasing problems if the buffer is left
>>> >>                  * behind for writeback and gets reallocated for another
>>> >>                  * use in a different page. */
>>> >>                 if (buffer_freed(bh)) {
>>> >>                         clear_buffer_freed(bh);
>>> >>                         clear_buffer_jbddirty(bh);
>>> >>                 }
>>> >> Note that, *We can't make sure "current running transaction" can complete
>>> >> commit work.* If we clear BH_JBDdirty bit here, this buffer may be freed
>>> >> here,  the log space of older transaction may be freed before the "current
>>> >> running transaction" complete commit work, and if this happends, filesystem
>>> >> will be inconsistent.
>>> >  Let me sketch the situation here:
>>> > The file F gets truncated. The inode is added to orphan list in some
>>> > transaction T1, only then jbd2_journal_invalidatepage can be called.
>>> > As you wrote above, it can happen that jbd2_journal_invalidatepage on
>>> > buffer B runs when some transaction T2 containing B is being committed and
>>> > in that case we set BH_Freed.  If T2 != T1 - i.e., T2 is being committed
>>> > and T1 is the running transaction, note that we clear the dirty bit only
>>> > when T2 is fully committed and we are processing forget list. So buffer has
>>> > been properly written to T2 and we just won't write it in the transaction
>>> > T1. And that is fine because as soon as transaction T1 finishes commit, we
>>> > don't care about what happens with buffers of F because the fact that F is
>>> > truncated is recorded and in case of crash we finish truncate during
>>> > journal replay. And if we crash before T1 finishes commit, we don't care
>>> > about contents of T1 either. If T2 == T1, the above reasoning applies as
>>> > well and the situation is even simpler.
>>> >
>>> >                                                                Honza
>>> > --
>>> > Jan Kara <jack@...e.cz>
>>> > SUSE Labs, CR
>>> >
>>>
>>>
>>>
>>> --
>>> 丁定华
>> --
>> Jan Kara <jack@...e.cz>
>> SUSE Labs, CR
>>
>
>
>
> --
> 丁定华
>



-- 
丁定华

View attachment "0001-Jbd2-delay-discarding-buffers-in-journal_unmap_buff.patch" of type "text/x-patch" (2678 bytes)