[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080701032250.GA16596@skywalker>
Date: Tue, 1 Jul 2008 08:52:50 +0530
From: "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
To: Mingming Cao <cmm@...ibm.com>
Cc: tytso@....edu, sandeen@...hat.com, jack@...e.cz,
linux-ext4@...r.kernel.org
Subject: Re: [PATCH updated] ext4: Handle page without buffers in
ext4_*_writepage()
On Mon, Jun 30, 2008 at 10:58:08AM -0700, Mingming Cao wrote:
>
> On Mon, 2008-06-30 at 15:36 +0530, Aneesh Kumar K.V wrote:
> > From: Aneesh Kumar <aneesh.kumar@...ux.vnet.ibm.com>
> >
> > It can happen that buffers are removed from the page before it gets
> > marked dirty and then is passed to writepage(). In writepage()
> > we just initialize the buffers and check whether they are mapped and non
> > delay. If they are mapped and non delay we write the page. Otherwise we
> > mark then dirty. With this change we don't do block allocation at all in
> > ext4_*_write_page.
>
> Probably should update the mpage_da_map_blocks() in mpage.c to indicate
> that the failure of block allocation in da_writepages() are no longer
> deferred to ext4_da_writepage(), as with this change we just simply
> re-dirty the page later in ext4_da_writepage()
mpage_da_map_blocks() being the VFS helper function I guess we should
not update that. It is correct even now that when we fail to get blocks in
mpage_da_map_blocks we expect the writepage of the file system to handle
that. With ext4 we redirty the page. Other file system which want to use
the mpage_da_map_blocks()/delayed allocation helper can choose to
implement error handling/block allocation in writepage()
>
> >
> > writepage() get called under many condition and with a locking order
> > of journal_start -> lock_page we shouldnot try to allocate blocks in
> > writepage() which get called after taking page lock. writepage can get
> > called via shrink_page_list even with a journal handle which was created
> > for doing inode update. For example when doing ext4_da_write_begin we
> > create a journal handle with credit 1 expecting a i_disksize update for
> > the inode. But ext4_da_write_begin can cause shrink_page_list via
> > _grab_page_cache. So having a valid handle via ext4_journal_current_handle
> > is not a guarantee that we can use the handle for block allocation in
> > writepage. We should not be using credits which are taken for other updates.
> > That would result in we running out of credits when we update inodes.
> >
> >
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@...ux.vnet.ibm.com>
> > ---
> > fs/ext4/inode.c | 169 ++++++++++++++++++++++++++++++++++++++++---------------
> > 1 files changed, 124 insertions(+), 45 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index cd5c165..18af94a 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1591,11 +1591,15 @@ static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> > handle_t *handle = NULL;
> >
> > handle = ext4_journal_current_handle();
> > - BUG_ON(handle == NULL);
> > - BUG_ON(create == 0);
> > -
> > - ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> > + if (!handle) {
> > + ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> > + bh_result, 0, 0, 0);
> > + BUG_ON(!ret);
> > + } else {
> > + ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> > bh_result, create, 0, EXT4_DELALLOC_RSVED);
> > + }
> > +
> > if (ret > 0) {
> > bh_result->b_size = (ret << inode->i_blkbits);
> >
>
> With this set of changes (that ext4_da_get_block_write() no longer
> called from ext4_da_writepage()), would it still possible to calling
> ext4_da_get_block_write() without a journal handle?
ext4_da_get_block_write cannot be called without a journal handle. But I
have to double check the call path. I guess if you consider
ext4_da_get_block_write as a helper API it make sense to add code paths
for !handle also. It states that if you call this helper without a
handle better make sure all the buffer_heads are mapped.
>
> > @@ -1633,15 +1637,37 @@ static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> >
> > static int ext4_bh_unmapped_or_delay(handle_t *handle, struct buffer_head *bh)
> > {
> > - return !buffer_mapped(bh) || buffer_delay(bh);
> > + /*
> > + * unmapped buffer is possible for holes.
> > + * delay buffer is possible with delayed allocation
> > + */
> > + return ((!buffer_mapped(bh) || buffer_delay(bh)) && buffer_dirty(bh));
> > +}
> > +
> > +static int ext4_normal_get_block_write(struct inode *inode, sector_t iblock,
> > + struct buffer_head *bh_result, int create)
> > +{
> > + int ret = 0;
> > + unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > +
> > + /*
> > + * we don't want to do block allocation in writepage
> > + * so call get_block_wrap with create = 0
> > + */
> > + ret = ext4_get_blocks_wrap(NULL, inode, iblock, max_blocks,
> > + bh_result, 0, 0, 0);
> > + if (ret > 0) {
> > + bh_result->b_size = (ret << inode->i_blkbits);
> > + ret = 0;
> > + }
> > + return ret;
> > }
> >
>
> Here we pass create = 0 to get_block all the time in ext4_**_writepage()
> all the time. Just wondering what about writeout those unwritten
> extents(preallocated extent) to disk? We need to pass create=1 to
> convert it to initialized extent and do proper split if needed. Does
> this convertion/split get done via ext4_da_get_block_write() from
> ext4_da_writepages()?
Yes. That get done via writepages() (pdflush) for delayed allocation and via
write_begin/write_end/page_mkwrite for non delayed allocation.
>
> > /*
> > - * get called vi ext4_da_writepages after taking page lock
> > - * We may end up doing block allocation here in case
> > - * mpage_da_map_blocks failed to allocate blocks.
> > - *
> > - * We also get called via journal_submit_inode_data_buffers
> > + * get called vi ext4_da_writepages after taking page lock (have journal handle)
> > + * get called via journal_submit_inode_data_buffers (no journal handle)
> > + * get called via shrink_page_list via pdflush (no journal handle)
> > + * or grab_page_cache when doing write_begin (have journal handle)
> > */
> > static int ext4_da_writepage(struct page *page,
> > struct writeback_control *wbc)
> > @@ -1649,37 +1675,61 @@ static int ext4_da_writepage(struct page *page,
> > int ret = 0;
> > loff_t size;
> > unsigned long len;
> > - handle_t *handle = NULL;
> > struct buffer_head *page_bufs;
> > struct inode *inode = page->mapping->host;
> >
> > - handle = ext4_journal_current_handle();
> > - if (!handle) {
> > - /*
> > - * This can happen when we aren't called via
> > - * ext4_da_writepages() but directly (shrink_page_list).
> > - * We cannot easily start a transaction here so we just skip
> > - * writing the page in case we would have to do so.
> > - * We reach here also via journal_submit_inode_data_buffers
> > - */
> > - size = i_size_read(inode);
> > + size = i_size_read(inode);
> > + if (page->index == size >> PAGE_CACHE_SHIFT)
> > + len = size & ~PAGE_CACHE_MASK;
> > + else
> > + len = PAGE_CACHE_SIZE;
> >
> > + if (page_has_buffers(page)) {
> > page_bufs = page_buffers(page);
> > - if (page->index == size >> PAGE_CACHE_SHIFT)
> > - len = size & ~PAGE_CACHE_MASK;
> > - else
> > - len = PAGE_CACHE_SIZE;
> > -
> > - if (walk_page_buffers(NULL, page_bufs, 0,
> > - len, NULL, ext4_bh_unmapped_or_delay)) {
> > + if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> > + ext4_bh_unmapped_or_delay)) {
> > /*
> > - * We can't do block allocation under
> > - * page lock without a handle . So redirty
> > - * the page and return
> > + * We don't want to do block allocation
> > + * So redirty the page and return
> > * We may reach here when we do a journal commit
> > * via journal_submit_inode_data_buffers.
> > * If we don't have mapping block we just ignore
> > - * them
> > + * them. We can also reach here via shrink_page_list
> > + */
> > + redirty_page_for_writepage(wbc, page);
> > + unlock_page(page);
> > + return 0;
> > + }
> > + } else {
> > + /*
> > + * The test for page_has_buffers() is subtle:
> > + * We know the page is dirty but it lost buffers. That means
> > + * that at some moment in time after write_begin()/write_end()
> > + * has been called all buffers have been clean and thus they
> > + * must have been written at least once. So they are all
> > + * mapped and we can happily proceed with mapping them
> > + * and writing the page.
> > + *
> > + * Try to initialize the buffer_heads and check whether
> > + * all are mapped and non delay. We don't want to
> > + * do block allocation here.
> > + */
> > + ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
> > + ext4_normal_get_block_write);
> > + if (!ret) {
> > + page_bufs = page_buffers(page);
> > + /* check whether all are mapped and non delay */
> > + if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> > + ext4_bh_unmapped_or_delay)) {
> > + redirty_page_for_writepage(wbc, page);
> > + unlock_page(page);
> > + return 0;
> > + }
> > + } else {
> > + /*
> > + * We can't do block allocation here
> > + * so just redity the page and unlock
> > + * and return
> > */
> > redirty_page_for_writepage(wbc, page);
> > unlock_page(page);
> > @@ -1688,9 +1738,11 @@ static int ext4_da_writepage(struct page *page,
> > }
> >
> > if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
> > - ret = nobh_writepage(page, ext4_da_get_block_write, wbc);
> > + ret = nobh_writepage(page, ext4_normal_get_block_write, wbc);
> > else
> > - ret = block_write_full_page(page, ext4_da_get_block_write, wbc);
> > + ret = block_write_full_page(page,
> > + ext4_normal_get_block_write,
> > + wbc);
> >
> > return ret;
> > }
> > @@ -2031,12 +2083,14 @@ static int __ext4_normal_writepage(struct page *page,
> > struct inode *inode = page->mapping->host;
> >
> > if (test_opt(inode->i_sb, NOBH))
> > - return nobh_writepage(page, ext4_get_block, wbc);
> > + return nobh_writepage(page,
> > + ext4_normal_get_block_write, wbc);
> > else
> > - return block_write_full_page(page, ext4_get_block, wbc);
> > + return block_write_full_page(page,
> > + ext4_normal_get_block_write,
> > + wbc);
> > }
> >
>
> I am wondering how does non-delayed ordered mode handling initialization
> the unwritten extent here, with about change that does not pass create =
> 1 to underlying ext4_get_block_wrap()?
writepages() will call get_block_wrap with create = 1
>
> Also a little puzzled why we need to fix here for non-delayed allocation
> writepage(): blocks are all allocated at prepare_write() time, with
> page_mkwrite this is also true for mmaped IO. So we shouldn't get into
> the case that we need to do block allocation in ext4_normal_writepage()
> called from non-delayed path?
Just to make in clear in code that we don't do block allocation at all in
writepage. If you consider ext4_normal_writepage as a helper API it kind
of makes it clear that irrespective of journal_handle we are not going
to make any block allocation in ext4_normal_Writepage.
> >
> > -
> > static int ext4_normal_writepage(struct page *page,
> > struct writeback_control *wbc)
> > {
> > @@ -2045,13 +2099,24 @@ static int ext4_normal_writepage(struct page *page,
> > loff_t len;
> >
> > J_ASSERT(PageLocked(page));
> > - J_ASSERT(page_has_buffers(page));
> > if (page->index == size >> PAGE_CACHE_SHIFT)
> > len = size & ~PAGE_CACHE_MASK;
> > else
> > len = PAGE_CACHE_SIZE;
> > - BUG_ON(walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > - ext4_bh_unmapped_or_delay));
> > +
> > + if (page_has_buffers(page)) {
> > + /* if page has buffers it should all be mapped
> > + * and allocated. If there are not buffers attached
> > + * to the page we know the page is dirty but it lost
> > + * buffers. That means that at some moment in time
> > + * after write_begin() / write_end() has been called
> > + * all buffers have been clean and thus they must have been
> > + * written at least once. So they are all mapped and we can
> > + * happily proceed with mapping them and writing the page.
> > + */
> > + BUG_ON(walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > + ext4_bh_unmapped_or_delay));
> > + }
> >
> > if (!ext4_journal_current_handle())
> > return __ext4_normal_writepage(page, wbc);
> > @@ -2071,7 +2136,8 @@ static int __ext4_journalled_writepage(struct page *page,
> > int ret = 0;
> > int err;
> >
> > - ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, ext4_get_block);
> > + ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
> > + ext4_normal_get_block_write);
> > if (ret != 0)
> > goto out_unlock;
> >
> > @@ -2118,13 +2184,24 @@ static int ext4_journalled_writepage(struct page *page,
> > loff_t len;
> >
> > J_ASSERT(PageLocked(page));
> > - J_ASSERT(page_has_buffers(page));
> > if (page->index == size >> PAGE_CACHE_SHIFT)
> > len = size & ~PAGE_CACHE_MASK;
> > else
> > len = PAGE_CACHE_SIZE;
> > - BUG_ON(walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > - ext4_bh_unmapped_or_delay));
> > +
> > + if (page_has_buffers(page)) {
> > + /* if page has buffers it should all be mapped
> > + * and allocated. If there are not buffers attached
> > + * to the page we know the page is dirty but it lost
> > + * buffers. That means that at some moment in time
> > + * after write_begin() / write_end() has been called
> > + * all buffers have been clean and thus they must have been
> > + * written at least once. So they are all mapped and we can
> > + * happily proceed with mapping them and writing the page.
> > + */
> > + BUG_ON(walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > + ext4_bh_unmapped_or_delay));
> > + }
> >
> > if (ext4_journal_current_handle())
> > goto no_write;
> > @@ -2142,7 +2219,9 @@ static int ext4_journalled_writepage(struct page *page,
> > * really know unless we go poke around in the buffer_heads.
> > * But block_write_full_page will do the right thing.
> > */
> > - return block_write_full_page(page, ext4_get_block, wbc);
> > + return block_write_full_page(page,
> > + ext4_normal_get_block_write,
> > + wbc);
> > }
> > no_write:
> > redirty_page_for_writepage(wbc, page);
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists