linux-ext4 - Re: [PATCH] jbd_commit_transaction() races with journal_try_to_drop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 05 May 2008 10:53:38 -0700
From:	Mingming Cao <cmm@...ibm.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Badari Pulavarty <pbadari@...ibm.com>, akpm@...ux-foundation.org,
	linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] jbd_commit_transaction() races with
	journal_try_to_drop_buffers() causing DIO failures

On Mon, 2008-05-05 at 19:06 +0200, Jan Kara wrote:
> On Thu 01-05-08 08:16:21, Badari Pulavarty wrote:
> > Hi Andrew & Jan,
> > 
> > I was able to reproduce the customer problem involving DIO
> > (invalidate_inode_pages2) problem by writing simple testcase
> > to keep writing to a file using buffered writes and DIO writes
> > forever in a loop. I see DIO writes fail with -EIO.
> > 
> > After a long debug, found 2 cases how this could happen.
> > These are race conditions with journal_try_to_free_buffers()
> > and journal_commit_transaction().
> > 
> > 1) journal_submit_data_buffers() tries to get bh_state lock. If
> > try lock fails, it drops the j_list_lock and sleeps for
> > bh_state lock, while holding a reference on the buffer.
> > In the meanwhile, journal_try_to_free_buffers() can clean up the
> > journal head could call try_to_free_buffers(). try_to_free_buffers()
> > would fail due to the reference held by journal_submit_data_buffers()
> > - which in turn causes failues for DIO (invalidate_inode_pages2()).
> > 
> > 2) When the buffer is on t_locked_list waiting for IO to finish,
> > we hold a reference and give up the cpu, if we can't get
> > bh_state lock. This causes try_to_free_buffers() to fail.
> > 
> > Fix is to drop the reference on the buffer if we can't get
> > bh_state lock, give up the cpu and re-try the whole operation -
> > instead of waiting for the vh_state lock.
> > 
> > Does this look like a resonable fix ?
>   As Mingming pointed out there are few other places where we could hold
> the bh reference. 

Actually there is one more place that journal_try_to_free_buffers
(calling from DIO path) could race with journal_submit_data_buffers(),
the DIO and buffered IO test still returns EIO with the fix which should
fixed the first 3 race cases.

I could not figure out how this could happen with Badari's fix to
inverted_lock().

This time the debug shows that the one who release put_bh() after the
journal_try_to_free_buffers() failed is from this code path:

journal_submit_data_buffers() {

...

 } else if (!locked && buffer_locked(bh)) {
                        __journal_file_buffer(jh, commit_transaction,
                                                BJ_Locked);
                        jbd_unlock_bh_state(bh);
                        put_bh(bh);

}

But when we get here we should already making sure the bh is on the
t_syncdata_list with badari's fix...

> Note also that we accumulate references to buffers in the
> wbuf[] list and we need that for submit_bh() which consumes one bh
> reference. 

It seems to me it's safe in this case. When
journal_try_to_free_buffers() called from DIO path,
filemap_fdatawrite_and_wait() already making sure that the IO submitted
by kjournald is already finished by waiting for buffer unlocked.

> Generally, it seems to me as a too fragile and impractical
> rule "nobody can hold bh reference when not holding page lock" which is
> basically what it comes down to if you really want to be sure that
> journal_try_to_free_buffers() succeeds. And also note that in principle
> there are other places which hold references to buffers without holding the
> page lock - for example writepage() in ordered mode (although this one is
> in practice hardly triggerable). So how we could fix at least the races
> with commit code is to implement launder_page() callback for ext3/4 which
> would wait for the previous transaction commit in case the page has buffers
> that are part of that commit (I don't want this logic in
> journal_try_to_free_buffers() as that is called also on memory-reclaim
> path, but journal_launder_page() is fine with me). This would be correct
> but could considerably slow down O_DIRECT writes in cases they're mixed
> with buffered writes so I'm not sure if this is acceptable.
>   OTOH with the ordered mode rewrite patch, the problem with commit code
> also goes away since there we don't need extra references to data buffers
> (we use just filemap_fdatawrite).
> 
> > 1) journal_submit_data_buffers() tries to get bh_state lock. If
> > try lock fails, it drops the j_list_lock and sleeps for
> > bh_state lock, while holding a reference on the buffer head.
> > In the meanwhile, journal_try_to_free_buffers() can clean up the
> > journal head could call try_to_free_buffers(). try_to_free_buffers()
> > would fail due to the reference held by journal_submit_data_buffers()
> > - which inturn causes failues for DIO (invalidate_inode_pages2()).
> > 
> > 2) When the buffer is on t_locked_list waiting for IO to finish,
> > we hold a reference and give up the cpu, if we can't get 
> > bh_state lock. This causes try_to_free_buffers() to fail.
> > 
> > Fix is to drop the reference on the buffer, give up the cpu
> > and re-try the whole operation.
> > 
> > Signed-off-by: Badari Pulavarty <pbadari@...ibm.com>
> > Reviewed-by: Mingming Cao <mcao@...ibm.com>
> 
> 							Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html