linux-ext4 - RE: ext4 out of order when use cfq scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f0c925079bb4450380c019a7455a2537@SGPMBX1004.APAC.bosch.com>
Date:	Thu, 7 Jan 2016 11:02:29 +0000
From:	"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@...bosch.com>
To:	Jan Kara <jack@...e.cz>
CC:	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"Li, Michael" <huayil@....qualcomm.com>
Subject: RE: ext4 out of order when use cfq scheduler



> -----Original Message-----
> From: Jan Kara [mailto:jack@...e.cz]
> Sent: Thursday, January 07, 2016 6:24 PM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@...bosch.com>
> Cc: Jan Kara <jack@...e.cz>; linux-ext4@...r.kernel.org
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > -----Original Message-----
> > > From: Jan Kara [mailto:jack@...e.cz]
> > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@...bosch.com>
> > > Subject: Re: ext4 out of order when use cfq scheduler
> > >
> > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > looking into the sources you've sent me I've noticed that
> > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > not use openflg at all. So file won't be synced at all. That
> > > > > would well explain why you see that not all file contents is
> > > > > written. So did you just send me a different version of the
> > > > > source or is your test program
> > > really buggy?
> > > > >
> > > >
> > > > Yes, it is a bug of the test code. So the test tool create files
> > > > without O_SYNC flag actually.  But , even in this case, is the out
> > > > of order acceptable ? or is it normal ?
> > >
> > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > are written and others are not since nobody guarantees order of
> > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > in the inode (but so far it isn't clear to me whether you really see
> > > unitialized data or whether we really wrote zeros to those blocks -
> > > ext4 can sometimes decide to do so).  Your traces and disk contents
> > > show that the problematic inode has extent of length 128 blocks
> > > starting at block
> > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > What is the block size of the filesystem?  Because inode size is only 0x40010.
> > >
> > > Some suggestions to try:
> > > 1) Print also length of a write request in addition to the starting
> > > block so that we can see how much actually got written
> >
> > Please see below failure analysis.
> >
> > > 2) Initialize the device to 0xff so that we can distinguish
> > > uninitialized blocks from zeroed-out blocks.
> >
> > Yes, i Initialize the device to 0xff this time.
> >
> > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > which it is wrong.
> > The wrong contents are old file contents which are created in previous
> > test round.  It is caused by the "wrong" sequence inode data(in
> > journal) and  the file contents. So the file contents are not updated.
> 
> So this confuses me somewhat. You previously said that you always remove files
> after each test round and then new ones are created. Is it still the case? So the old
> file contents you speak about above is just some random contents that happened
> to be in disk blocks we freshly allocated to the file, am I right?

Yes. You are right.
 The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.


> 
> OK, so I was looking into the code and indeed, reality is correct and my mental
> model was wrong! ;) I thought that inode gets added to the list of inodes for which
> we need to wait for data IO completion during transaction commit during block
> allocation. And I was wrong. It used to happen in
> mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> wrong because although we submit data writes before dropping handle for
> allocating transaction and updating i_size, nobody guarantees that data IO is not
> delayed in the block layer until transaction commit.
> Which seems to happen in your case. I'll send a fix. Thanks for your report and
> persistence!
> 

Thanks a lot for your feedback :-)
Because I am not familiar with the detail of the ext4 internal code.  I will try to understand your explanation which you describe above.  And have a look on related funcations.
Could you send the fix in this mail ?
And whether the kernel 3.14 also have such issue, right ?

> 								Honza
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html