lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1217014002.6322.29.camel@mingming-laptop>
Date:	Fri, 25 Jul 2008 12:26:42 -0700
From:	Mingming Cao <cmm@...ibm.com>
To:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
Cc:	tytso <tytso@....edu>, Shehjar Tikoo <shehjart@....unsw.edu.au>,
	linux-ext4@...r.kernel.org
Subject: Re: [RFC]Ext4: journal credits reservation fixes for DIO,
	fallocate and delalloc writepages


在 2008-07-23三的 13:12 +0530,Aneesh Kumar K.V写道:
> Hi Mingming,
> 
> Comments below

> On Tue, Jul 22, 2008 at 05:51:51PM -0700, Mingming Cao wrote:
> > Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
> > 
> > From: Mingming Cao <cmm@...ibm.com>
> > 
> > With delalloc, at writepages() time, we need to reserve enough credits to start
> > a new handle, to allow possible multiple segment of block allocations under a
> > single call mapge_da_writepages(), so that metadata updates could fit into a single
> > transaction. This patch fixed this by calculating the needed credits for
> > write-out given number of dirty pages, with the consideration of discontinues
> > block allocations. It fixed both extent files and non extent files.
> > 
> > This patch also fixed the journal credit reservation for DIO. Currently the
> > estimated credits for DIO is only based on non extent format file. That credit
> > is not enough for mballoc a single extent on extent based file. This patch
> > fixed that.
> > 
> > The fallocate double booking credits for modifying super block etc, this patch
> > fixed that.
> > 
> > This also fix credit reservation in migration and defrag code.
> > 
> > Signed-off-by: Mingming Cao <cmm@...ibm.com>
> > ---
> >  fs/ext4/defrag.c  |    4 +-
> >  fs/ext4/ext4.h    |    4 +-
> >  fs/ext4/extents.c |   37 ++++++++++++---------
> >  fs/ext4/inode.c   |   93 ++++++++++++++++++++++++------------------------------
> >  fs/ext4/migrate.c |    4 +-
> >  5 files changed, 72 insertions(+), 70 deletions(-)
> > 
> > Index: linux-2.6.26-git6/fs/ext4/ext4.h
> > ===================================================================
> > --- linux-2.6.26-git6.orig/fs/ext4/ext4.h	2008-07-21 17:35:17.000000000 -0700
> > +++ linux-2.6.26-git6/fs/ext4/ext4.h	2008-07-21 17:36:17.000000000 -0700
> > @@ -1149,7 +1149,7 @@ extern void ext4_truncate (struct inode 
> >  extern void ext4_set_inode_flags(struct inode *);
> >  extern void ext4_get_inode_flags(struct ext4_inode_info *);
> >  extern void ext4_set_aops(struct inode *inode);
> > -extern int ext4_writepage_trans_blocks(struct inode *);
> > +extern int ext4_writepages_trans_blocks(struct inode *, int num);
> >  extern int ext4_block_truncate_page(handle_t *handle,
> >  		struct address_space *mapping, loff_t from);
> >  extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> > @@ -1314,7 +1314,7 @@ extern const struct inode_operations ext
> > 
> >  /* extents.c */
> >  extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
> > -extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
> > +extern int ext4_ext_writepages_trans_blocks(struct inode *inode, int);
> >  extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> >  			ext4_lblk_t iblock,
> >  			unsigned long max_blocks, struct buffer_head *bh_result,
> > Index: linux-2.6.26-git6/fs/ext4/extents.c
> > ===================================================================
> > --- linux-2.6.26-git6.orig/fs/ext4/extents.c	2008-07-21 17:35:17.000000000 -0700
> > +++ linux-2.6.26-git6/fs/ext4/extents.c	2008-07-21 17:36:17.000000000 -0700
> > @@ -1887,11 +1887,12 @@ static int ext4_ext_rm_idx(handle_t *han
> > 
> >  /*
> >   * ext4_ext_calc_credits_for_insert:
> > - * This routine returns max. credits that the extent tree can consume.
> > + * This routine returns max. credits that needed to insert an extent
> > + * to the extent tree.
> >   * It should be OK for low-performance paths like ->writepage()
> >   * To allow many writing processes to fit into a single transaction,
> > - * the caller should calculate credits under i_data_sem and
> > - * pass the actual path.
> > + * When pass the actual path, the caller should calculate credits
> > + * under i_data_sem.
> >   */
> >  int ext4_ext_calc_credits_for_insert(struct inode *inode,
> >  						struct ext4_ext_path *path)
> > @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
> >  	 */
> >  	needed += (depth * 2) + (depth * 2);
> > 
> > -	/* any allocation modifies superblock */
> > -	needed += 1;
> > -
> 
> 
> Why are we dropping the super block modification credit ? An insert of
> an extent can result in super block modification also. If the goal is to
> use ext4_writepages_trans_blocks everywhere then this change is correct.
> But i see  many place not using ext4_writepages_trans_blocks.
> 
ext4_writepages_trans_blocks() will take care of the credits need for
updating superblock, for just once.ext4_ext_calc_credits_for_insert()
calculate the credits needed for inserting a single extent, You could
see in ext4_writepages_trans_blocks(), it will multiple it will the
total numberof blocks. If we account for superblock credit in
ext4_ext_calc_credits_for_insert(),  then super block updates for
multiple extents allocation will be overaccounted, and have to remove
that later in ext4_writepages_trans_blocks()

Other places calling  ext4_ext_calc_credits_for_insert() (mostly
migrate.c and defrag,c) are updated to add extra 4 credits, including
superblock, inode block and quota blocks.

> You also need to update the function comment saying that super block
> update is not accounted.Also it doesn't account for block bitmap,
> group descriptor and inode block update.
> 

Credits for block bitmap and group descriptors are accounted in
ext4_ext_calc_credits_for_insert()

inode block update is only accounted once per writepages, so it's
accounted in
ext4_writepages_trans_blocks()/ext4_ext_writepages_trans_blocks()

> >  	return needed;
> >  }
> > 
> > @@ -2940,8 +2938,8 @@ void ext4_ext_truncate(struct inode *ino
> >  	/*
> >  	 * probably first extent we're gonna free will be last in block
> >  	 */
> > -	err = ext4_writepage_trans_blocks(inode) + 3;
> > -	handle = ext4_journal_start(inode, err);
> > +	handle = ext4_journal_start(inode,
> > +				    ext4_writepages_trans_blocks(inode, 1) + 3);
> 
> 
> What is +3 for ? Can you add a comment on that.  I understand that it
> was there before. I guess +3 is not needed.?
> 
 I guess it was there for superblock +inode block + quota block, but
actually superblock and inode block and quota blocks are already
accounted, it probably could be removed. 
I did not pay a lot attention  to it, I guess  a little overbooking
probably safer, I could remove it if you feel strong about it.

> 
> 
> >  	if (IS_ERR(handle))
> >  		return;
> > 
> > @@ -2994,18 +2992,28 @@ out_stop:
> >  }
> > 
> >  /*
> > - * ext4_ext_writepage_trans_blocks:
> > + * ext4_ext_writepages_trans_blocks:
> >   * calculate max number of blocks we could modify
> >   * in order to allocate new block for an inode
> >   */
> > -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
> > +int ext4_ext_writepages_trans_blocks(struct inode *inode, int num)
> >  {
> 
> I guess the name is misleading. @num above is number of blocks. how
> about ext4_ext_writeblocks_trans(struct inode *node, int nrblocks)
> 
> 

> Also add a comment stating we consider the worst case where each block
> can result in an extent.
> 

I will add a comment about the worse case,  and change num to nrblocks.

> >  	int needed;
> > 
> > +	/* cost of adding a single extent:
> > +	 * index blocks, leafs, bitmaps,
> > +	 * groupdescp
> > +	 */
> >  	needed = ext4_ext_calc_credits_for_insert(inode, NULL);
> > 
> > -	/* caller wants to allocate num blocks, but note it includes sb */
> > -	needed = needed * num - (num - 1);
> > +	/*
> > +	 * For data=journalled mode need to account for the data blocks
> > +	 * Also need to add super block and inode block
> > +	 */
> > +	if (ext4_should_journal_data(inode))
> > +		needed = num * (needed + 1)  + 2;
> > +	else
> > +		needed = num * needed  + 2;
> > 
> 
> 
> We are not accounting here for the bitmap and group descriptor.
> ext4_ext_calc_credits_for_insert is not accounting for the credit need
> to update bitmap and group descriptor.


>  We also need to account updating
> bitmap and group descriptor for new blocks that would be allocated
> as a part of extent insert.
> 
> 

No need for that. It's being accounted in
ext4_ext_calc_credits_for_insert(). In the worse case the tree depth is
5, inserting a extent would require a 2 updates (bitmap and group
descriptor) for each level (including the leaf, the new blocks), for old
tree and new tree.

/*
         * Index split can happen, we would need:
         *    allocate intermediate indexes (bitmap + group)
         *  + change two blocks at each level, but root (already
included)
         */
        needed += (depth * 2) + (depth * 2);



> >  #ifdef CONFIG_QUOTA
> >  	needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > @@ -3074,10 +3082,9 @@ long ext4_fallocate(struct inode *inode,
> >  	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> >  							- block;
> >  	/*
> > -	 * credits to insert 1 extent into extent tree + buffers to be able to
> > -	 * modify 1 super block, 1 block bitmap and 1 group descriptor.
> > +	 * credits to insert 1 extent into extent tree
> >  	 */
> > -	credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> > +	credits =  EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> 
> 
> 
> Why is this change ?

 modify for block bitmap and  group descriptor  is included in
EXT4_DATA_TRANS_BLOCKS(inode->i_sb)-> (EXT4_SINGLEDATA_TRANS_BLOCKS(sb) 

/* Define the number of blocks we need to account to a transaction to
 * modify one block of data.
 *
 * We may have to touch one inode, one bitmap buffer, up to three
 * indirection blocks, the group and superblock summaries, and the data
 * block to complete the transaction.
 *
 * For extents-enabled fs we may have to allocate and modify up to
 * 5 levels of tree + root which are stored in the inode. */

#define EXT4_SINGLEDATA_TRANS_BLOCKS(sb)
\
        (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)
\
                || test_opt(sb, EXTENTS) ? 27U : 8U)



others including superblock, inode block, quota and xattr blocks are
also included in 


/* Define the minimum size for a transaction which modifies data.  This
 * needs to take into account the fact that we may end up modifying two
 * quota files too (one for the group, one for the user quota).  The
 * superblock only gets updated once, of course, so don't bother
 * counting that again for the quota updates. */

#define EXT4_DATA_TRANS_BLOCKS(sb)
(EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + \
                                         EXT4_XATTR_TRANS_BLOCKS - 2 + \
                                         2*EXT4_QUOTA_TRANS_BLOCKS(sb))


> 
> >  	mutex_lock(&inode->i_mutex);
> >  retry:
> >  	while (ret >= 0 && ret < max_blocks) {
> > Index: linux-2.6.26-git6/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.26-git6.orig/fs/ext4/inode.c	2008-07-21 17:35:17.000000000 -0700
> > +++ linux-2.6.26-git6/fs/ext4/inode.c	2008-07-21 17:44:45.000000000 -0700
> > @@ -1015,15 +1015,6 @@ static void ext4_da_update_reserve_space
> > 
> >  /* Maximum number of blocks we map for direct IO at once. */
> >  #define DIO_MAX_BLOCKS 4096
> > -/*
> > - * Number of credits we need for writing DIO_MAX_BLOCKS:
> > - * We need sb + group descriptor + bitmap + inode -> 4
> > - * For B blocks with A block pointers per block we need:
> > - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> > - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> > - */
> > -#define DIO_CREDITS 25
> > -
> > 
> >  /*
> >   *
> > @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode, 
> >  	handle_t *handle = ext4_journal_current_handle();
> >  	int ret = 0, started = 0;
> >  	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > +	int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> 
> Again this should be
> 	int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS);
> 
No. DIO case is different than writepages(). When get_block() is called
from DIO path, the get_block()  is called in a loop, and the credit is
reserved inside the loop. Each time get_block(),will return a single
extent, so we should use   EXT4_DATA_TRANS_BLOCKS(inode->i_sb) which is
calculate a single chunk of allocation credits.

ext4_da_writepages() is different, we have to reserve the credits
outside the loop of calling get_block(), since the underlying
get_block() could be called multiple times, we need to worse case, so
ext4_writepages_trans_blocks() is called to handling the worse case.

> > 
> >  	if (create && !handle) {
> >  		/* Direct IO write... */
> >  		if (max_blocks > DIO_MAX_BLOCKS)
> >  			max_blocks = DIO_MAX_BLOCKS;
> > -		handle = ext4_journal_start(inode, DIO_CREDITS +
> > -			      2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> > +		handle = ext4_journal_start(inode, dio_credits);
> >  		if (IS_ERR(handle)) {
> >  			ret = PTR_ERR(handle);
> >  			goto out;
> > @@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file 
> >  				struct page **pagep, void **fsdata)
> >  {
> >   	struct inode *inode = mapping->host;
> > -	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > +	int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
> >  	handle_t *handle;
> >  	int retries = 0;
> >   	struct page *page;
> > @@ -2179,18 +2170,7 @@ static int ext4_da_writepage(struct page
> >  	return ret;
> >  }
> > 
> > -/*
> > - * For now just follow the DIO way to estimate the max credits
> > - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> > - * todo: need to calculate the max credits need for
> > - * extent based files, currently the DIO credits is based on
> > - * indirect-blocks mapping way.
> > - *
> > - * Probably should have a generic way to calculate credits
> > - * for DIO, writepages, and truncate
> > - */
> >  #define EXT4_MAX_WRITEBACK_PAGES      DIO_MAX_BLOCKS
> > -#define EXT4_MAX_WRITEBACK_CREDITS    DIO_CREDITS
> > 
> >  static int ext4_da_writepages(struct address_space *mapping,
> >  				struct writeback_control *wbc)
> > @@ -2210,13 +2190,8 @@ static int ext4_da_writepages(struct add
> >  	if (!mapping->nrpages)
> >  		return 0;
> > 
> > -	/*
> > -	 * Estimate the worse case needed credits to write out
> > -	 * EXT4_MAX_BUF_BLOCKS pages
> > -	 */
> > -	needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> > -
> >  	to_write = wbc->nr_to_write;
> > +
> >  	if (!wbc->range_cyclic) {
> >  		/*
> >  		 * If range_cyclic is not set force range_cont
> > @@ -2227,6 +2202,20 @@ static int ext4_da_writepages(struct add
> >  	}
> > 
> >  	while (!ret && to_write) {
> > +		/*
> > +		 * set the max dirty pages could be write at a time
> > +		 * to fit into the reserved transaction credits
> > +		 */
> > +		if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> > +			wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> > +
> > +		/*
> > +		 * Estimate the worse case needed credits to write out
> > +		 * to_write pages
> > +		 */
> > +		needed_blocks = ext4_writepages_trans_blocks(inode,
> > +							     wbc->nr_to_write);
> > +
> >  		/* start a new transaction*/
> >  		handle = ext4_journal_start(inode, needed_blocks);
> >  		if (IS_ERR(handle)) {
> > @@ -2246,12 +2235,6 @@ static int ext4_da_writepages(struct add
> >  			}
> > 
> >  		}
> > -		/*
> > -		 * set the max dirty pages could be write at a time
> > -		 * to fit into the reserved transaction credits
> > -		 */
> > -		if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> > -			wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> > 
> >  		to_write -= wbc->nr_to_write;
> >  		ret = mpage_da_writepages(mapping, wbc,
> > @@ -2612,7 +2595,8 @@ static int __ext4_journalled_writepage(s
> >  	 * references to buffers so we are safe */
> >  	unlock_page(page);
> > 
> > -	handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > +	handle = ext4_journal_start(inode,
> > +				    ext4_writepages_trans_blocks(inode, 1));
> >  	if (IS_ERR(handle)) {
> >  		ret = PTR_ERR(handle);
> >  		goto out;
> > @@ -4286,20 +4270,20 @@ int ext4_getattr(struct vfsmount *mnt, s
> >  /*
> >   * How many blocks doth make a writepage()?
> >   *
> > - * With N blocks per page, it may be:
> > - * N data blocks
> > + * With N blocks per page,  and P pages, it may be:
> > + * N*P data blocks
> >   * 2 indirect block
> >   * 2 dindirect
> >   * 1 tindirect
> > - * N+5 bitmap blocks (from the above)
> > - * N+5 group descriptor summary blocks
> > + * N*P+5 bitmap blocks (from the above)
> > + * N*P+5 group descriptor summary blocks
> >   * 1 inode block
> >   * 1 superblock.
> >   * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
> >   *
> > - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> > + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> >   *
> > - * With ordered or writeback data it's the same, less the N data blocks.
> > + * With ordered or writeback data it's the same, less the N*P data blocks.
> >   *
> >   * If the inode's direct blocks can hold an integral number of pages then a
> >   * page cannot straddle two indirect blocks, and we can only touch one indirect
> > @@ -4310,19 +4294,15 @@ int ext4_getattr(struct vfsmount *mnt, s
> >   * block and work out the exact number of indirects which are touched.  Pah.
> >   */
> > 
> > -int ext4_writepage_trans_blocks(struct inode *inode)
> > +static int ext4_writepages_trans_blocks_old(struct inode *inode, int num)
> >  {
> 
> a better name would be
> 
> static int ext4_writeblocks_trans_old(struct inode *inode, int nrblocks)
> 
> 
> 
> > -	int bpp = ext4_journal_blocks_per_page(inode);
> > -	int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> > +	int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3;
> >  	int ret;
> > 
> > -	if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > -		return ext4_ext_writepage_trans_blocks(inode, bpp);
> > -
> >  	if (ext4_should_journal_data(inode))
> > -		ret = 3 * (bpp + indirects) + 2;
> > +		ret = 3 * (num + indirects) + 2;
> >  	else
> > -		ret = 2 * (bpp + indirects) + 2;
> > +		ret = 2 * (num + indirects) + 2;
> 
> 
> With non journalled moded we should just decrease num. I guess the above
> should be
> 	if (ext4_should_journal_data(inode))
> 		ret = 3 * (num + indirects) + 2;
> 	else
> 		ret = 3 * (num + indirects) + 2 - num;
> 
> 
> 

Well, I think in the journalled mode we need to journal the content of
the indirect/double indirect blocks also, no?


Mingming

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ