[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4DC1C67B.3030801@canonical.com>
Date: Thu, 05 May 2011 00:34:51 +0300
From: Surbhi Palande <surbhi.palande@...onical.com>
To: Jan Kara <jack@...e.cz>
CC: Toshiyuki Okajima <toshi.okajima@...fujitsu.com>,
Ted Ts'o <tytso@....edu>,
Masayoshi MIZUMA <m.mizuma@...fujitsu.com>,
Andreas Dilger <adilger.kernel@...ger.ca>,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
sandeen@...hat.com
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due
to a deadlock
On 05/04/2011 10:19 PM, Jan Kara wrote:
> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>>>> (2011/04/16 2:13), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>>>> probably
>>>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>>>> because
>>>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>>>> and so
>>>>>>>> we could possibly modify the filesystem.
>>>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>>>> minor-faulting pages.
>>>>>>>
>>>>>>> (minor-pagefault)
>>>>>>> -> do_wp_page()
>>>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>>>> => BLOCK!
>>>>>>>
>>>>>>> (major-pagefault)
>>>>>>> -> do_liner_fault()
>>>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>>>> => BLOCK!
>>>>>>>
>>>>>>>>
>>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>>>> file (mmap).
>>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>>>> while
>>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>>>> operation
>>>>>>>>>>> while fsfreezing.
>>>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>>>> means we
>>>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>>>> not mean
>>>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>>>> properly
>>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>>>> can't allow
>>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>>>> path can
>>>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>>>> happens after
>>>>>>>>> fsfreeze operation is done...
>>>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>>>> above
>>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>>> else?
>>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>>> So, I had read the kernel source code around it, then I maybe
>>>>>>> understand...
>>>>>>>
>>>>>>> I worry whether we can update the file data in mmap case while
>>>>>>> fsfreezing.
>>>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>>>> is not a
>>>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>>>> problem.
>>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>>> while fsfreezing)
>>>>>>>
>>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>>> the page is not allocated yet. (major fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) __do_falut is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>>>>
>>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>>> the page is already allocated, and the buffer_heads of the page
>>>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) do_wp_page is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>
>>>> What happens in the case as follows:
>>>>
>>>> Task 1: Mmapped writes
>>>> t1)ext4_page_mkwrite()
>>>> t2) ext4_write_begin() (FS is thawed so we proceed)
>>>> t3) ext4_write_end() (journal is stopped now)
>>>> -----Pre-empted-----
>>>>
>>>>
>>>> Task 2: Freeze Task
>>>> t4) freezes the super block...
>>>> ...(continues)....
>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>> completed execution.
>>>>
>>>> Task 1: Mmapped writes
>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>> tn+2) __do_fault() gets control, code gets executed.
>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>> a file based page which faulted.
>>>>
>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>> You are right ext4_page_mkrite() as currently implemented has problems.
>>> You have to return the page locked (and check for frozen fs with page lock
>>> held) to avoid races.
>>>
>>> If you check for frozen fs with page lock held, you are guaranteed that
>>> freezing code must wait for the page to get unlocked before proceeding. And
>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>> makes freezing code write the page and writeprotect it again. So everything
>>> will be safe.
>> For the locked page to be a part of the freeze initiated sync,
>> should its owner inode not be dirtied? The page fault handler
>> dirties the page, but who ensures that the inode is dirtied at this
>> point?
Well, I mean it as follows:
Doesn't the writeback code (invoked via sync_filesystem(sb)) write all
the dirty pages of all the _dirty_ inodes of a superblock?
So in the window from the point where ext4_page_mkwrite returns to
__do_fault() _till_ you mark the inode dirty (in __mark_inode_dirty()),
you can have a race with freeze i.e if freeze happens meanwhile, then
the sync initiated by freeze will not consider this locked page as the
owner inode is _clean_ (or not dirtied yet) at that point?
Key: tx: time at unit x
P1: mmapped writes
t1) __do_page_fault()
t2) ext4_page_mkwrite()
// owner inode of the page is in _clean_ state - not yet dirtied
--- pre-empted---
P2: Freeze_super
tn) freeze_super gets control
freezes the F.S, skips the owner inode as it is in the clean state.
syncs all the other dirty inodes. page cache is now clean.
P1: mmapped writes (resume)
tn+x)__do_page_fault() gets control back:
tn+x+1) set_page_dirty()
tn+x+2) __set_page_dirty_buffers()
tn+x+3) __set_page_dirty()
tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
So don't we end up dirtying the page cache when the F.S is frozen?
Again, apologies if I understood the writeback code or something else wrong!
Warm Regards,
Surbhi.
> Follow the path from set_page_dirty() -> __set_page_dirty_buffers()
> -> __set_page_dirty() -> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> More code reading would save you (and me) some typing ;).
P/S: Sorry about that!
>
> Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists