lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 21 Aug 2017 10:02:20 +0200
From:   Jan Kara <jack@...e.cz>
To:     Wang Shilong <wangshilong1991@...il.com>
Cc:     linux-ext4@...r.kernel.org, sihara@....com, tytso@....edu,
        adilger@...ger.ca, lixi@....com, wshilong@....com, jack@...e.cz
Subject: Re: [PATCH 3/3 v5] ext4: reduce lock contention in __ext4_new_inode

On Sun 20-08-17 19:54:37, Wang Shilong wrote:
> From: Wang Shilong <wshilong@....com>
> 
> While running number of creating file threads concurrently,
> we found heavy lock contention on group spinlock:
> 
> FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
> ext4_create                    1707443399           1440000      1185.72
> _raw_spin_lock                 1317641501           180899929    7.28
> jbd2__journal_start            287821030            1453950      197.96
> jbd2_journal_get_write_access  33441470             73077185     0.46
> ext4_add_nondir                29435963             1440000      20.44
> ext4_add_entry                 26015166             1440049      18.07
> ext4_dx_add_entry              25729337             1432814      17.96
> ext4_mark_inode_dirty          12302433             5774407      2.13
> 
> most of cpu time blames to _raw_spin_lock, here is some testing
> numbers with/without patch.
> 
> Test environment:
> Server : SuperMicro Sever (2 x E5-2690 v3@...0GHz, 128GB 2133MHz
>          DDR4 Memory, 8GbFC)
> Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
>           Read Intensive SSD)
> 
> format command:
>         mkfs.ext4 -J size=4096
> 
> test command:
>         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
>                 -r -i 1 -v -p 10 -u #first run to load inode
> 
>         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
>                 -r -i 3 -v -p 10 -u
> 
> Kernel version: 4.13.0-rc3
> 
> Test  1,440,000 files with 48 directories by 48 processes:
> 
> Without patch:
> 
> File Creation   File removal
> 79,033          289,569 ops/per second
> 81,463          285,359
> 79,875          288,475
> 
> With patch:
> File Creation   File removal
> 810669		301694
> 812805		302711
> 813965		297670
> 
> Creation performance is improved more than 10X with large
> journal size. The main problem here is we test bitmap
> and do some check and journal operations which could be
> slept, then we test and set with lock hold, this could
> be racy, and make 'inode' steal by other process.
> 
> However, after first try, we could confirm handle has
> been started and inode bitmap journaled too, then
> we could find and set bit with lock hold directly, this
> will mostly gurateee success with second try.
> 
> Tested-by: Shuichi Ihara <sihara@....com>
> Signed-off-by: Wang Shilong <wshilong@....com>

I like the result! You can add:

Reviewed-by: Jan Kara <jack@...e.cz>

								Honza

> ---
> v4->v5: better refactor codes as Jan Kara suggested.
> ---
>  fs/ext4/ialloc.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 9e6eb1c..fb83a36 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -730,6 +730,27 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
>  	return ret;
>  }
>  
> +static int find_inode_bit(struct super_block *sb, ext4_group_t group,
> +			  struct buffer_head *bitmap, unsigned long *ino)
> +{
> +next:
> +	*ino = ext4_find_next_zero_bit((unsigned long *)
> +				       bitmap->b_data,
> +				       EXT4_INODES_PER_GROUP(sb), *ino);
> +	if (*ino >= EXT4_INODES_PER_GROUP(sb))
> +		return 0;
> +
> +	if ((EXT4_SB(sb)->s_journal == NULL) &&
> +	    recently_deleted(sb, group, *ino)) {
> +		*ino = *ino + 1;
> +		if (*ino < EXT4_INODES_PER_GROUP(sb))
> +			goto next;
> +		return 0;
> +	}
> +
> +	return 1;
> +}
> +
>  /*
>   * There are two policies for allocating an inode.  If the new inode is
>   * a directory, then a forward search is made for a block group with both
> @@ -910,21 +931,16 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  		}
>  
>  repeat_in_this_group:
> -		ino = ext4_find_next_zero_bit((unsigned long *)
> -					      inode_bitmap_bh->b_data,
> -					      EXT4_INODES_PER_GROUP(sb), ino);
> -		if (ino >= EXT4_INODES_PER_GROUP(sb))
> +		ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
> +		if (!ret2)
>  			goto next_group;
> -		if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> +
> +		if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) {
>  			ext4_error(sb, "reserved inode found cleared - "
>  				   "inode=%lu", ino + 1);
>  			goto next_group;
>  		}
> -		if ((EXT4_SB(sb)->s_journal == NULL) &&
> -		    recently_deleted(sb, group, ino)) {
> -			ino++;
> -			goto next_inode;
> -		}
> +
>  		if (!handle) {
>  			BUG_ON(nblocks <= 0);
>  			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
> @@ -944,11 +960,23 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  		}
>  		ext4_lock_group(sb, group);
>  		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> +		if (ret2) {
> +			/* Someone already took the bit. Repeat the search
> +			 * with lock held.
> +			 */
> +			ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
> +			if (ret2) {
> +				ext4_set_bit(ino, inode_bitmap_bh->b_data);
> +				ret2 = 0;
> +			} else {
> +				ret2 = 1; /* we didn't grab the inode */
> +			}
> +		}
>  		ext4_unlock_group(sb, group);
>  		ino++;		/* the inode bitmap is zero-based */
>  		if (!ret2)
>  			goto got; /* we grabbed the inode! */
> -next_inode:
> +
>  		if (ino < EXT4_INODES_PER_GROUP(sb))
>  			goto repeat_in_this_group;
>  next_group:
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists