lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <6.0.0.20.2.20080807115853.03f95b78@172.19.0.2>
Date:	Thu, 07 Aug 2008 12:15:14 +0900
From:	Hisashi Hifumi <hifumi.hisashi@....ntt.co.jp>
To:	Mingming Cao <cmm@...ibm.com>, Jan Kara <jack@...e.cz>
Cc:	Chris Mason <chris.mason@...cle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] jbd jbd2: fix dio write returning
  EIOwhentry_to_release_page fails


At 07:57 08/08/07, Mingming Cao wrote:
>
>蝨ィ 2008-08-06荳臥噪 15:53 +0200�招�an Kara蜀咎%�肖.
>> On Wed 06-08-08 
>09:25:13, Chris Mason wrote:
>> > On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:
>> > > 蝨ィ 2008-08-05莠檎噪 12:17 -0400�招�hris Mason蜀咎%�肖.
>> > > > On 
>Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
>> > > > > >> > 
>> > > > > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c 
>> > > > > >linux-2.6.27-rc1/fs/jbd/transaction.c
>> > > > > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c	2008-07-29 
>> > > > > >19:28:47.000000000 +0900
>> > > > > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c	2008-07-29 
>20:40:12.000000000 +0900
>> > > > > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
>> > > > > >> >  	*/
>> > > > > >> >  	if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & 
>__GFP_FS)) {
>> > > > > >> >  		journal_wait_for_transaction_sync_data(journal);
>> > > > > >> > +
>> > > > > >> > +		bh = head;
>> > > > > >> > +		do {
>> > > > > >> > +			while (atomic_read(&bh->b_count))
>> > > > > >> > +				schedule();
>> > > > > >> > +		} while ((bh = bh->b_this_page) != head);
>> > > > > >> >  		ret = try_to_free_buffers(page);
>> > > > > >> >  	}
>> > > > > >> 
>> > > > > >> The loop is problematic.  If the scheduler decides to keep 
>running this
>> > > > > >> task then we have a busy loop.  If this task has realtime 
>policy then
>> > > > > >> it might even lock up the kernel.
>> > > > > >> 
>> > > > > >
>> > > > > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
>> > > > > >not be the best idea there either.
>> > > > > >
>> > > > > >This code gets called from releasepage, which is used other 
>places than
>> > > > > >the O_DIRECT invalidation paths, I'd be worried about performance
>> > > > > >problems here.
>> > > > > >
>> > > > > 
>> > > > > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
>> > > > > is called from performance sensitive part, gfp_mask should not be set.
>> > > > > b_count check loop is inside of (gfp_mask & __GFP_WAIT) && 
>(gfp_mask & __GFP_FS) check.
>> > > > 
>> > > > Looks like try_to_free_pages will go into releasepage with wait & fs
>> > > > both set.  This kind of change would make me very nervous.
>> > > > 
>> > > 
>> > > Hi Chris,
>> > > 
>> > > The gfp_mask try_to_free_pages() takes from it's caller will past it
>> > > down to try_to_release_page().  Based on the meaning of __GFP_WAIT and
>> > > GFP_FS, if the upper level caller set these two flags,  I assume the
>> > > upper level caller expect delay and wait for fs to finish?
>> > > 
>> > > 
>> > > But I agree that using a loop in journal_try_to_free_buffers() to wait
>> > > for the busy bh release the counter is expensive...
>> > 
>> > I rediscovered your old thread about trying to do this in a launder_page
>> > call ;)
>>   Yes, we thought about using launder_page() before :).
>> 
>> > Does it make more sense to fix do_launder_page to call into the FS on
>> > every page, and let the FS check for PageDirty on its own?  That way
>> > invalidate_inode_pages2_range basically gets its own private call into
>> > the FS that says wait around until this page is really free.
>>   That would certainly work as well. But IMHO waiting for ->writepage()
>> call to finish isn't really a big deal even in try_to_release_page() if
>> __GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
>> effective way to do so and so Hisashi used that "wait for b_count to drop"
>> which looks really scary and I don't like it as well.
>> 
>
>I was  looking at the comment in invalidate_complete_page2(), which is
>now only called from DIO path, it saids
>
>/*
> * This is like invalidate_complete_page(), except it ignores the page's
> * refcount.  We do this because invalidate_inode_pages2() needs
>stronger
> * invalidation guarantees, and cannot afford to leave pages behind
>because
> * shrink_page_list() has a temp ref on them, or because they're
>transiently
> * sitting in the lru_cache_add() pagevecs.
> */
>
>
>I am wondering why we need stronger invalidate hurantees for DIO->
>invalidate_inode_pages_range(),which force the page being removed from
>page cache? In case of bh is busy due to ext3 writeout,
>journal_try_to_free_buffers() could return different error number(EBUSY)
>to try_to_releasepage() (instead of EIO).  In that case,  could we just
>leave the page in the cache, clean pageuptodate() (to force later buffer
>read to read from disk) and then invalidate_complete_page2() return
>successfully? Any issue with this way?

My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
bh busy, and dio write falls back to buffered write. This is easy to fix.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ