[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <503F4AB1.4050501@rs.jp.nec.com>
Date: Thu, 30 Aug 2012 20:12:49 +0900
From: Akira Fujita <a-fujita@...jp.nec.com>
To: Jan Kara <jack@...e.cz>
CC: Dmitry Monakhov <dmonakhov@...nvz.org>, linux-ext4@...r.kernel.org,
tytso@....edu
Subject: Re: [PATCH 1/3] ext4: nonda_switch prevent deadlock
Hi,
(2012/08/29 22:28), Jan Kara wrote:
> On Tue 28-08-12 20:21:41, Dmitry Monakhov wrote:
>> Currently ext4_da_write_begin may deadlock if called with opened journal
>> transaction. Real life example:
>> ->move_extent_per_page()
>> ->ext4_journal_start()-> hold journal transaction
>> ->write_begin()
>> ->ext4_da_write_begin()
>> ->ext4_nonda_switch()
>> ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
>>
>> This bug may be easily fixed by code reordering,
>> But in some cases it should be possible to call write_begin()
>> while holding journal's transaction, in this case caller may avoid
>> recoursion by passing AOP_FLAG_NOFS flag.
> Well, I find calling ext4_write_begin() with a transaction started a bug.
> Possibly ext4_write_begin() can be tweaked to handle that but things would
> be simpler if we didn't have to.
>
> Looking into move_extent_per_page(), calling ->write_begin() doesn't seem
> to be quite right there anyway. For example it results in filling holes
> under that page which is not desirable. I'm not even sure why do we call
> ->write_begin() there at all. The data in the page is unchanged. So it
> should be enough to just remap buffers and mark the page dirty (but I might
> be missing some subtlety here). Fujita-san, can you possibly explain?
Originally, calling write_begin/end in move_extent_per_page() was
to get a page and mark bh which exchanged by mext_replace_branches() as dirty.
But if there is a better way to do this, it makes sense to fix.
Regards,
Akira Fujita
> Honza
>
>> ---
>> fs/ext4/inode.c | 28 +++++++++++++++++-----------
>> 1 files changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 6324f74..d12d30e 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -889,6 +889,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>> struct page *page;
>> pgoff_t index;
>> unsigned from, to;
>> + int nofs = flags & AOP_FLAG_NOFS;
>> +
>> + /* We cannot recurse into the filesystem if the transaction is already
>> + * started */
>> + BUG_ON(!nofs && journal_current_handle());
>>
>> trace_ext4_write_begin(inode, pos, len, flags);
>> /*
>> @@ -906,9 +911,6 @@ retry:
>> ret = PTR_ERR(handle);
>> goto out;
>> }
>> -
>> - /* We cannot recurse into the filesystem as the transaction is already
>> - * started */
>> flags |= AOP_FLAG_NOFS;
>>
>> page = grab_cache_page_write_begin(mapping, index, flags);
>> @@ -957,7 +959,8 @@ retry:
>> }
>> }
>>
>> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> + if (!nofs && ret == -ENOSPC &&
>> + ext4_should_retry_alloc(inode->i_sb, &retries))
>> goto retry;
>> out:
>> return ret;
>> @@ -2447,7 +2450,7 @@ out_writepages:
>> }
>>
>> #define FALL_BACK_TO_NONDELALLOC 1
>> -static int ext4_nonda_switch(struct super_block *sb)
>> +static int ext4_nonda_switch(struct super_block *sb, int writeback_allowed)
>> {
>> s64 free_blocks, dirty_blocks;
>> struct ext4_sb_info *sbi = EXT4_SB(sb);
>> @@ -2475,7 +2478,7 @@ static int ext4_nonda_switch(struct super_block *sb)
>> * Even if we don't switch but are nearing capacity,
>> * start pushing delalloc when 1/2 of free blocks are dirty.
>> */
>> - if (free_blocks < 2 * dirty_blocks)
>> + if (writeback_allowed && free_blocks < 2 * dirty_blocks)
>> writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
>>
>> return 0;
>> @@ -2490,10 +2493,14 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
>> pgoff_t index;
>> struct inode *inode = mapping->host;
>> handle_t *handle;
>> + int nofs = flags & AOP_FLAG_NOFS;
>>
>> index = pos >> PAGE_CACHE_SHIFT;
>> + /* We cannot recurse into the filesystem if the transaction is already
>> + * started */
>> + BUG_ON(!nofs && journal_current_handle());
>>
>> - if (ext4_nonda_switch(inode->i_sb)) {
>> + if (ext4_nonda_switch(inode->i_sb, !nofs)) {
>> *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
>> return ext4_write_begin(file, mapping, pos,
>> len, flags, pagep, fsdata);
>> @@ -2512,8 +2519,6 @@ retry:
>> ret = PTR_ERR(handle);
>> goto out;
>> }
>> - /* We cannot recurse into the filesystem as the transaction is already
>> - * started */
>> flags |= AOP_FLAG_NOFS;
>>
>> page = grab_cache_page_write_begin(mapping, index, flags);
>> @@ -2538,7 +2543,8 @@ retry:
>> ext4_truncate_failed_write(inode);
>> }
>>
>> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> + if (!nofs && ret == -ENOSPC &&
>> + ext4_should_retry_alloc(inode->i_sb, &retries))
>> goto retry;
>> out:
>> return ret;
>> @@ -4791,7 +4797,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>> /* Delalloc case is easy... */
>> if (test_opt(inode->i_sb, DELALLOC) &&
>> !ext4_should_journal_data(inode) &&
>> - !ext4_nonda_switch(inode->i_sb)) {
>> + !ext4_nonda_switch(inode->i_sb, 1)) {
>> do {
>> ret = __block_page_mkwrite(vma, vmf,
>> ext4_da_get_block_prep);
>> --
>> 1.7.7.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Akira Fujita <a-fujita@...jp.nec.com>
The First Fundamental Software Development Group,
Platform Division, NEC Software Tohoku, Ltd.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists