lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANubcdV32L71ARCznZgKdrt0BmSyOYwW50L17zP=TG4PO2MH4Q@mail.gmail.com>
Date: Wed, 7 Aug 2024 16:10:50 +0800
From: Stephen Zhang <starzhangzsd@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: tytso@....edu, jack@...e.com, linux-ext4@...r.kernel.org, 
	linux-kernel@...r.kernel.org, zhangshida@...inos.cn, 
	Baolin Liu <liubaolin@...inos.cn>
Subject: Re: [RFC PATCH] jbd2: fix a potential assertion failure due to
 improperly dirtied buffer

Jan Kara <jack@...e.cz> 于2024年8月6日周二 21:40写道:
>
> Hello!
>

Hi, Jan:)

> On Sat 20-07-24 14:23:56, zhangshida wrote:
> > From: Shida Zhang <zhangshida@...inos.cn>
> >
> > On an old kernel version(4.19, ext3, journal=data, pagesize=64k),
> > an assertion failure will occasionally be triggered by the line below:
>
> OK, just out of curiosity, why are you using data=journal mode? It doesn't
> really get that much testing and the performance is quite bad...
>

It is used by one of our customers. It's more like a historical issue:
About 12 years ago, they used data=journal mode for the benefit of user
data consistency brought by the mode.
Time goes by, they attempted to change, say, maybe change it to ext4
at least, but found it is no more stable than it was under ext3...
And yeah, they decided to just leave the thing as it was and keep the system
under that state until now...

> > jbd2_journal_commit_transaction
> > {
> > ...
> > J_ASSERT_BH(bh, !buffer_dirty(bh));
> > /*
> > * The buffer on BJ_Forget list and not jbddirty means
> > ...
> > }
> >
> > AFAIC, that's how the problem works:
> > --------
> > journal_unmap_buffer
> > jbd2_journal_invalidatepage
> > __ext4_journalled_invalidatepage
> > ext4_journalled_invalidatepage
> > do_invalidatepage
> > truncate_inode_pages_range
> > truncate_inode_pages
> > truncate_pagecache
> > ext4_setattr
> > --------
> >
> > First try to truncate and invalidate the page.
> > Sometimes the buffer and the page won't be freed immediately.
> > the buffer will be sent to the BJ_Forget list of the currently
> > committing transaction. Maybe the transaction knows when and how
> > to free the buffer better.
> > The buffer's states now: !jbd_dirty !mapped !dirty
> >
> > Then jbd2_journal_commit_transaction()will try to iterate over the
> > BJ_Forget list:
> > --------
> > jbd2_journal_commit_transaction()
> >       while (commit_transaction->t_forget) {
> >       ...
> >       }
> > --------
> >
> > At this exact moment, another write comes:
> > --------
> > mark_buffer_dirty
> > __block_write_begin_int
> > __block_write_begin
> > ext4_write_begin
> > --------
> > it sees a unmapped new buffer, and marks it as dirty.
>
> This should not happen. When ext4_setattr() truncates the file, we do not
> allow reallocating these blocks for other purposes until the transaction

ext4_setattr() will try to free it by adding it to the BJ_Forget list
for further processing.
Put it more clearly,
when ext4_setattr() truncates the file, the buffer is not fully freed
yet. It's half-freed.
Furthermore,
Because the buffer is half-freed, the reallocating thing won't need to happen.
Now,
under that scenario, can we redirty the half-freed buffer on the BJ_Forget list?
The answer may be 'yes'.

redirty it by the following code:
ext4_block_write_begin
    if (!buffer_mapped(bh)) { // check 1
         _ext4_get_block(inode, block, bh, 1);
        (buffer_new(bh)) { // check 2
             if (folio_test_uptodate(folio)) { // check 3
                 mark_buffer_dirty(bh);


But can it pass the checks?

Is the buffer mapped? no, journal_unmap_buffer() will clear the mapped state.
Pass the check 1.

Is the buffer new? maybe, _ext4_get_block will mark it as new when the
underlying block is unwritten.
Pass the check 2.

Is the folio uptodate? yes.
Pass the check 3.

Yep, the buffer finally gets dirtied and jbd2_journal_commit_transaction() sees
a dirty but not jbd_dirty buffer on the BJ_Forget list.

For another proof, there is indeed a small window where the buffer could be
seen dirty.
Have a look at the code and comment in do_journal_get_write_access:
----------------
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh)
{
...
/*
* __block_write_begin() could have dirtied some buffers. Clean
* the dirty bit as jbd2_journal_get_write_access() could complain
* otherwise about fs integrity issues. Setting of the dirty bit
* by __block_write_begin() isn't a real problem here as we clear
* the bit before releasing a page lock and thus writeback cannot
* ever write the buffer.
*/
if (dirty)
clear_buffer_dirty(bh); // clear the dirty immdiately in case some bad
things happen
...
}
---------------


> commits. Did you find this using some tracing or just code analysis?
>

There are both vmcore and code analysis.
I will attach the vmcore-dmesg.txt.
Let me know If you are interested in the details since the vmcore and
vmlinux is too
big(1.5g) to upload via the mail...


Thanks,
Stephen.






> There have been quite some fixes to data=journal mode since 4.19 so it may
> quite well be that your problem is actually already fixed in current
> kernels...
>
>                                                                 Honza
>
> > Finally, jbd2_journal_commit_transaction()will trip over the assertion
> > during the BJ_Forget list iteration.
> >
> > After an code analysis, it seems that nothing can prevent the
> > __block_write_begin() from dirtying the buffer at this very moment.
> >
> > The same condition may also be applied to the lattest kernel version.
> >
> > To fix it:
> > Add some checks to see if it is the case dirtied by __block_write_begin().
> > if yes, it's okay and just let it go. The one who dirtied it and the
> > jbd2_journal_commit_transaction() will know how to cooperate together and
> > deal with it in a proper manner.
> > if no, try to complain as we normally will do.
> >
> > Reported-by: Baolin Liu <liubaolin@...inos.cn>
> > Signed-off-by: Shida Zhang <zhangshida@...inos.cn>
> > ---
> >  fs/jbd2/commit.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 75ea4e9a5cab..2c13d0af92d8 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -1023,7 +1023,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >                       if (is_journal_aborted(journal))
> >                               clear_buffer_jbddirty(bh);
> >               } else {
> > -                     J_ASSERT_BH(bh, !buffer_dirty(bh));
> > +                     bool try_to_complain = 1;
> > +                     struct folio *folio = NULL;
> > +
> > +                     folio = bh->b_folio;
> > +                     /*
> > +                      * Try not to complain about the dirty buffer marked dirty
> > +                      * by __block_write_begin().
> > +                      */
> > +                     if (buffer_dirty(bh) && folio && folio->mapping
> > +                         && folio_test_locked(folio))
> > +                             try_to_complain = 0;
> > +
> > +                     if (try_to_complain)
> > +                             J_ASSERT_BH(bh, !buffer_dirty(bh));
> >                       /*
> >                        * The buffer on BJ_Forget list and not jbddirty means
> >                        * it has been freed by this transaction and hence it
> > --
> > 2.33.0
> >
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR

View attachment "vmcore-dmesg.txt" of type "text/plain" (1038257 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ