[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <763022183d4647ef99a333b1bab75e7e@SGPMBX1004.APAC.bosch.com>
Date: Fri, 8 Jan 2016 02:18:40 +0000
From: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@...bosch.com>
To: Jan Kara <jack@...e.cz>
CC: "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"Li, Michael" <huayil@....qualcomm.com>
Subject: RE: ext4 out of order when use cfq scheduler
> -----Original Message-----
> From: Jan Kara [mailto:jack@...e.cz]
> Sent: Thursday, January 07, 2016 8:19 PM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@...bosch.com>
> Cc: Jan Kara <jack@...e.cz>; linux-ext4@...r.kernel.org; Li, Michael
> <huayil@....qualcomm.com>
> Subject: Re: ext4 out of order when use cfq scheduler
>
> On Thu 07-01-16 12:47:36, Jan Kara wrote:
> > On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:jack@...e.cz]
> > > > Sent: Thursday, January 07, 2016 6:24 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@...bosch.com>
> > > > Cc: Jan Kara <jack@...e.cz>; linux-ext4@...r.kernel.org
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > -----Original Message-----
> > > > > > From: Jan Kara [mailto:jack@...e.cz]
> > > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@...bosch.com>
> > > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > > >
> > > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > > So you are running in 'ws' mode of your tool, am I right?
> > > > > > > > Just looking into the sources you've sent me I've noticed
> > > > > > > > that although you set O_SYNC in openflg when mode ==
> > > > > > > > MODE_WS, you do not use openflg at all. So file won't be
> > > > > > > > synced at all. That would well explain why you see that
> > > > > > > > not all file contents is written. So did you just send me
> > > > > > > > a different version of the source or is your test program
> > > > > > really buggy?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, it is a bug of the test code. So the test tool create
> > > > > > > files without O_SYNC flag actually. But , even in this
> > > > > > > case, is the out of order acceptable ? or is it normal ?
> > > > > >
> > > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some
> > > > > > files are written and others are not since nobody guarantees
> > > > > > order of writeback of inodes. OTOH you shouldn't ever see
> > > > > > uninitialized data in the inode (but so far it isn't clear to
> > > > > > me whether you really see unitialized data or whether we
> > > > > > really wrote zeros to those blocks -
> > > > > > ext4 can sometimes decide to do so). Your traces and disk
> > > > > > contents show that the problematic inode has extent of length
> > > > > > 128 blocks starting at block
> > > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > > What is the block size of the filesystem? Because inode size is only
> 0x40010.
> > > > > >
> > > > > > Some suggestions to try:
> > > > > > 1) Print also length of a write request in addition to the
> > > > > > starting block so that we can see how much actually got
> > > > > > written
> > > > >
> > > > > Please see below failure analysis.
> > > > >
> > > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > > uninitialized blocks from zeroed-out blocks.
> > > > >
> > > > > Yes, i Initialize the device to 0xff this time.
> > > > >
> > > > > > 3) Report exactly for which 512-byte blocks checksum matches
> > > > > > and for which it is wrong.
> > > > > The wrong contents are old file contents which are created in
> > > > > previous test round. It is caused by the "wrong" sequence inode
> > > > > data(in
> > > > > journal) and the file contents. So the file contents are not updated.
> > > >
> > > > So this confuses me somewhat. You previously said that you always
> > > > remove files after each test round and then new ones are created.
> > > > Is it still the case? So the old file contents you speak about
> > > > above is just some random contents that happened to be in disk blocks we
> freshly allocated to the file, am I right?
> > >
> > > Yes. You are right.
> > > The "old file contents" means that the disk blocks which the contents is
> generated from last test round, and they are allocated to a new file in new test
> round.
> > >
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And
> > > > I was wrong. It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code. I will try to
> understand your explanation which you describe above. And have a look on
> related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
>
> Oh, I have realized the patch is on top of current ext4 development tree and it
> won't compile for current vanilla kernel because of EXT4_GET_BLOCKS_ZERO
> check. Just remove that line when you get compilation failure.
>
> > + if (map->m_flags & EXT4_MAP_NEW &&
> > + !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> > + !(flags & EXT4_GET_BLOCKS_ZERO) &&
>
> Just remove the above line and things should work for older kernels as well.
>
> > + ext4_should_order_data(inode)) {
> > + ret = ext4_jbd2_file_inode(handle, inode);
> > + if (ret)
> > + return ret;
> > + }
> > }
> > return retval;
> > }
>
Just confirmed with you because the patch tool didn't found:
"out_sem:
ret = check_block_validity(inode, map);" in my kernel.
after checking the code, I add the modification to the end of function : ext4_map_blocks
below is the diff. please help to double confirm.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10b71e4..d29a1d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -753,6 +753,10 @@ has_zeroout:
int ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
+ if(ext4_should_order_data(inode)) {
+ ret = ext4_jbd2_file_inode(handle, inode);
+ if (ret)
+ return ret;
}
return retval;
}
@@ -1113,15 +1117,6 @@ static int ext4_write_end(struct file *file,
int i_size_changed = 0;
trace_ext4_write_end(inode, pos, len, copied);
- if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
- ret = ext4_jbd2_file_inode(handle, inode);
- if (ret) {
- unlock_page(page);
- page_cache_release(page);
- goto errout;
- }
- }
-
if (ext4_has_inline_data(inode)) {
ret = ext4_write_inline_data_end(inode, pos, len,
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists