linux-kernel - RE: [f2fs-dev] [PATCH v2] f2fs: serialize block allocation of dio writes to enhance multithread performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-id: <016301d14f72$0a449d70$1ecdd850$@samsung.com>
Date:	Fri, 15 Jan 2016 16:51:37 +0800
From:	Chao Yu <chao2.yu@...sung.com>
To:	'Jaegeuk Kim' <jaegeuk@...nel.org>
Cc:	linux-kernel@...r.kernel.org,
	linux-f2fs-devel@...ts.sourceforge.net
Subject: RE: [f2fs-dev] [PATCH v2] f2fs: serialize block allocation of dio
 writes to enhance multithread performance

Hi Jaegeuk,

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@...nel.org]
> Sent: Friday, January 15, 2016 7:54 AM
> To: Chao Yu
> Cc: linux-kernel@...r.kernel.org; linux-f2fs-devel@...ts.sourceforge.net
> Subject: Re: [f2fs-dev] [PATCH v2] f2fs: serialize block allocation of dio writes to enhance
> multithread performance
> 
> Hi Chao,
> 
> On Wed, Jan 13, 2016 at 08:13:36PM +0800, Chao Yu wrote:
> > Hi Jaegeuk,
> >
> > > -----Original Message-----
> > > From: Chao Yu [mailto:chao2.yu@...sung.com]
> > > Sent: Tuesday, December 29, 2015 2:44 PM
> > > To: Jaegeuk Kim
> > > Cc: linux-kernel@...r.kernel.org; linux-f2fs-devel@...ts.sourceforge.net
> > > Subject: [f2fs-dev] [PATCH v2] f2fs: serialize block allocation of dio writes to enhance
> > > multithread performance
> > >
> > > When performing big dio writes concurrently, our performace will be low
> > > because of Thread A's allocation of multi continuous blocks will be
> > > interrupted by Thread B, there are two cases as below:
> > >  - In Thread B, we may change current segment to a new segment for LFS
> > >    allocation if we dio write in the beginning of the file.
> > >  - In Thread B, we may allocate blocks in the middle of Thread A's
> > >    allocation, which make blocks allocated in Thread A being inconsecutive.
> > >
> > > This patch adds writepages mutex lock to make block allocation in dio write
> > > being atomic to avoid above issues.
> > >
> > > Test environment 1:
> > > ubuntu os with linux kernel 4.4-rc4, intel i7-3770, 16g memory,
> > > 32g kingston sd card.
> > >
> > > fio --name seqw --ioengine=sync --invalidate=1 --rw=write --directory=/mnt/f2fs
> > > --filesize=256m --size=16m --bs=2m --direct=1
> > > --numjobs=10
> > >
> > > before:
> > >   WRITE: io=163840KB, aggrb=5125KB/s, minb=512KB/s, maxb=776KB/s, mint=21105msec,
> > > maxt=31967msec
> > > patched:
> > >   WRITE: io=163840KB, aggrb=10424KB/s, minb=1042KB/s, maxb=1172KB/s, mint=13975msec,
> > > maxt=15717msec
> > >
> > > Test environment 2:
> > > Note4 eMMC
> > >
> > > fio --name seqw --ioengine=sync --invalidate=1 --rw=write --directory=/data/test/
> > > --filesize=256m --size=64m --bs=2m --direct=1
> > > --numjobs=16
> > >
> > > before:
> > >   WRITE: io=1024.0MB, aggrb=103583KB/s, minb=6473KB/s, maxb=8806KB/s, mint=7442msec,
> > > maxt=10123msec
> > > patched:
> > >   WRITE: io=1024.0MB, aggrb=124860KB/s, minb=7803KB/s, maxb=9315KB/s, mint=7035msec,
> > > maxt=8398msec
> > >
> > > As Yunlei He reported when he test with current patch:
> > > "Does share writepages mutex lock have an effect on cache write?
> > > Here is AndroBench result on my phone:
> > >
> > > Before patch:
> > > 			1R1W		8R8W		16R16W
> > > Sequential Write	161.31		163.85		154.67
> > > Random  Write		9.48		17.66		18.09
> > >
> > > After patch:
> > > 			1R1W		8R8W		16R16W
> > > Sequential Write	159.61		157.24		160.11
> > > Random  Write		9.17		8.51		8.8
> > >
> > > Unit:Mb/s, File size: 64M, Buffer size: 4k"
> > >
> > > The turth is androidbench uses single thread with dio write to test performance
> > > of sequential write, and use multi-threads with dio write to test performance
> > > of random write. so we can not see any improvement in sequentail write test
> > > since serializing dio page allocation can only improve performance in
> > > multi-thread scenario, and there is a regression in multi-thread test with 4k
> > > dio write, this is because grabbing sbi->writepages lock for serializing block
> > > allocation stop the concurrency, so that less small dio bios could be merged,
> > > moreover, when there are huge number of small dio writes, grabbing mutex lock
> > > per dio increases the overhead.
> >
> > Since the whole DIOs in Androbench are IPU, so in mutex_lock/allocate_block/
> > mutex_unlock actually we didn't allocating blocks, but blocking the concurrency
> > by grabing sbi->writepages.
> 
> Hmm, DIO doesn't grab writepages.

Yep, writepages mutex lock has not been added in ->direct_IO yet.

Previously, I'm intended to add this lock into ->direct_IO, but as Yunlei
reported, there would be a regression when he tested that patch with
Androbench.

So, what I said above is try to explain the reason why it caused a
regression.

Below is test result which I test current v2 patch ("f2fs: serialize block
allocation of dio writes to enhance multithread performance") in Note4.

a) DIO Append Write:
IO size: 4K
fio --name seqw --ioengine=sync --invalidate=1 --rw=write --directory=/data/test/ --filesize=16m --size=16m --bs=4k --direct=1
--numjobs=32
		serialize all		serialize none
threshold	1 			2
1		45637			42753
2		40101			39263
3		44423			34576
4		43426			38701
5		39730			39000
6		39927			40870
7		41845			38099
8		41936			37691
average		42128.125		38869.125

IO size: 128K
fio --name seqw --ioengine=sync --invalidate=1 --rw=write --directory=/data/test/ --filesize=32m --size=32m --bs=128k --direct=1
--numjobs=32
		serialize all		serialize none
threshold	32 			33
1		133712			105660
2		134899			113765
3		134969			100669
4		134157			117711
5		134657			114988
6		134277			112871
7		134761			116651
8		133270			118764
average		134337.75		112634.875

IO size: 2M
fio --name seqw --ioengine=sync --invalidate=1 --rw=write --directory=/data/test/ --filesize=64m --size=64m --bs=2m --direct=1
--numjobs=32
		serialize all		serialize none
threshold	512 			513
1		133542			128572
2		133934			130843
3		132714			130566
4		132437			130663
5		133068			130826
6		132362			132120
7		134149			129645
8		132975			130080
average		133147.625		130414.375

IO size: 4K-1024K
fio --name seqw --ioengine=sync --invalidate=1 --rw=write --directory=/data/test/ --filesize=32m --size=32m --bsrang=4k-1024k
--direct=1 --numjobs=32
		serialize all		serialize partial	serialize none
threshold	1 			33			257
1		118724			101922			109226
2		114174			108503			108435
3		115673			109890			106605
4		111824			112762			114648
5		110144			102300			115253
6		108626			107063			113963
7		114824			113469			108234
8		102610			110609			109397
average		112074.875		108314.75		110720.125

b) DIO Random Overwrite:
IO size: 4K
fio --name randw --ioengine=sync --invalidate=1 --rw=randwrite --directory=/data/test/ --filesize=64m --size=16m --bs=4k --direct=1
--numjobs=32
		serialize all		serialize none
threshold	1 			2
1		17454			17340
2		16266			17829
3		17167			17286
4		16746			17578
5		17250			18148
6		17260			17589
7		17146			16568
8		17106			17725
average		17049.375		17507.875

IO size: 16K
fio --name randw --ioengine=sync --invalidate=1 --rw=randwrite --directory=/data/test/ --filesize=64m --size=16m --bs=32k --direct=1
--numjobs=32
		serialize all		serialize none
threshold	8 			9
1		47779			39845
2		34887			42638
3		46987			42380
4		45138			46500
5		41862			49044
6		40061			49643
7		42729			43647
8		40441			46030
average		42485.5			44965.875

IO size: 4K-1024K
fio --name randw --ioengine=sync --invalidate=1 --rw=randwrite --directory=/data/test/ --filesize=64m --size=16m --bsrang=4k-1024k
--direct=1 --numjobs=32
		serialize all		serialize partial	serialize none
threshold	1 			33			257
1		91235			95528			80236
2		90882			86887			94293
3		83446			92117			89710
4		89083			84344			92902
5		94335			85906			94487
6		84371			95779			89399
7		95597			87964			89264
8		94352			94394			92374
average		90412.625		90364.875		90333.125

We can see in all cases of append DIO write, serialization of DIOs will
lead to better performance, so how do you think of adding writepages lock
when writing offset is larger than isize?

> And, if we need to check every time, we can lose the bandwidth on every new
> appending writes.
> 
> How about adding a condition to check whether we need to allocate or not?
> 
> For example,
> 	if (i_size == calculated i_blocks && offset < i_size)
> 		skip allocation;

Yeah, that could be a faster way, I will take this. And I expect 'fallocate
after isize' case or 'punch inside isize' case will be rare one, then our
judgment condition will be accurate most of the time. :)

Thanks,

> 
> Thanks,
> 
> >
> > How about grabbing sbi->writepages when it needs to allocate block(s) like
> > following patch?
> >
> > ---
> >  fs/f2fs/data.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 51 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> > index ac9e7c6..4439a85 100644
> > --- a/fs/f2fs/data.c
> > +++ b/fs/f2fs/data.c
> > @@ -458,6 +458,44 @@ got_it:
> >  	return page;
> >  }
> >
> > +static bool __need_allocate_blocks(struct inode *inode, loff_t offset,
> > +								size_t count)
> > +{
> > +	struct dnode_of_data dn;
> > +	pgoff_t start = F2FS_BYTES_TO_BLK(offset);
> > +	pgoff_t end = start + F2FS_BYTES_TO_BLK(count);
> > +	pgoff_t end_offset;
> > +
> > +	while (start < end) {
> > +		int ret;
> > +
> > +		set_new_dnode(&dn, inode, NULL, NULL, 0);
> > +		ret = get_dnode_of_data(&dn, start, LOOKUP_NODE);
> > +		if (ret == -ENOENT) {
> > +			return true;
> > +		} else if (ret) {
> > +			return ret;
> > +		}
> > +
> > +		end_offset = ADDRS_PER_PAGE(dn.node_page, F2FS_I(inode));
> > +
> > +		while (dn.ofs_in_node < end_offset && start < end) {
> > +			block_t blkaddr;
> > +
> > +			blkaddr = datablock_addr(dn.node_page, dn.ofs_in_node);
> > +			if (blkaddr == NULL_ADDR || blkaddr == NEW_ADDR) {
> > +				f2fs_put_dnode(&dn);
> > +				return true;
> > +			}
> > +			start++;
> > +			dn.ofs_in_node++;
> > +		}
> > +
> > +		f2fs_put_dnode(&dn);
> > +	}
> > +	return false
> > +}
> > +
> >  static int __allocate_data_block(struct dnode_of_data *dn)
> >  {
> >  	struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode);
> > @@ -1620,7 +1658,9 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
> >  	struct file *file = iocb->ki_filp;
> >  	struct address_space *mapping = file->f_mapping;
> >  	struct inode *inode = mapping->host;
> > +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> >  	size_t count = iov_iter_count(iter);
> > +	int rw = iov_iter_rw(iter);
> >  	int err;
> >
> >  	/* we don't need to use inline_data strictly */
> > @@ -1635,20 +1675,24 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter
> *iter,
> >  	if (err)
> >  		return err;
> >
> > -	trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> > +	trace_f2fs_direct_IO_enter(inode, offset, count, rw);
> >
> > -	if (iov_iter_rw(iter) == WRITE) {
> > -		err = __allocate_data_blocks(inode, offset, count);
> > -		if (err)
> > -			goto out;
> > +	if (rw == WRITE) {
> > +		if (__need_allocate_blocks(inode, offset, count) > 0) {
> > +			mutex_lock(&sbi->writepages);
> > +			err = __allocate_data_blocks(inode, offset, count);
> > +			mutex_unlock(&sbi->writepages);
> > +			if (err)
> > +				goto out;
> > +		}
> >  	}
> >
> >  	err = blockdev_direct_IO(iocb, inode, iter, offset, get_data_block_dio);
> >  out:
> > -	if (err < 0 && iov_iter_rw(iter) == WRITE)
> > +	if (err < 0 && rw == WRITE)
> >  		f2fs_write_failed(mapping, offset + count);
> >
> > -	trace_f2fs_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), err);
> > +	trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
> >
> >  	return err;
> >  }
> > --
> > 2.6.3
> >
> > Thanks,
> >
> > >
> > > After all, serializing dio could only be used for concurrent scenario of
> > > big dio, so this patch also introduces a threshold in sysfs to provide user
> > > the interface of defining 'a big dio' with specified page number, which could
> > > be used to control wthether serialize or not that kind of dio with specified
> > > page number.
> > >
> > > The optimization works in rare scenario.
> > >
> > > Signed-off-by: Chao Yu <chao2.yu@...sung.com>
> > > ---
> > > v2:
> > >  - merge another related patch into this one.
> > > ---
> > >  Documentation/ABI/testing/sysfs-fs-f2fs | 12 ++++++++++++
> > >  fs/f2fs/data.c                          | 17 +++++++++++++----
> > >  fs/f2fs/f2fs.h                          |  3 +++
> > >  fs/f2fs/super.c                         |  3 +++
> > >  4 files changed, 31 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs
> > > b/Documentation/ABI/testing/sysfs-fs-f2fs
> > > index 0345f2d..560a4f1 100644
> > > --- a/Documentation/ABI/testing/sysfs-fs-f2fs
> > > +++ b/Documentation/ABI/testing/sysfs-fs-f2fs
> > > @@ -92,3 +92,15 @@ Date:		October 2015
> > >  Contact:	"Chao Yu" <chao2.yu@...sung.com>
> > >  Description:
> > >  		 Controls the count of nid pages to be readaheaded.
> > > +
> > > +What:		/sys/fs/f2fs/<disk>/serialized_dio_pages
> > > +Date:		December 2015
> > > +Contact:	"Chao Yu" <chao2.yu@...sung.com>
> > > +Description:
> > > +		 It is a threshold with the unit of page size.
> > > +                 If DIO page count is equal or big than the threshold,
> > > +                 whole process of block address allocation of dio pages
> > > +                 will become atomic like buffered write.
> > > +                 It is used to maximize bandwidth utilization in the
> > > +                 scenario of concurrent write with dio vs buffered or
> > > +                 dio vs dio.
> > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> > > index 8a89810..d506a0e 100644
> > > --- a/fs/f2fs/data.c
> > > +++ b/fs/f2fs/data.c
> > > @@ -1619,7 +1619,9 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter
> *iter,
> > >  	struct file *file = iocb->ki_filp;
> > >  	struct address_space *mapping = file->f_mapping;
> > >  	struct inode *inode = mapping->host;
> > > +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> > >  	size_t count = iov_iter_count(iter);
> > > +	int rw = iov_iter_rw(iter);
> > >  	int err;
> > >
> > >  	/* we don't need to use inline_data strictly */
> > > @@ -1634,20 +1636,27 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter
> *iter,
> > >  	if (err)
> > >  		return err;
> > >
> > > -	trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> > > +	trace_f2fs_direct_IO_enter(inode, offset, count, rw);
> > > +
> > > +	if (rw == WRITE) {
> > > +		bool serialized = (F2FS_BYTES_TO_BLK(count) >=
> > > +						sbi->serialized_dio_pages);
> > >
> > > -	if (iov_iter_rw(iter) == WRITE) {
> > > +		if (serialized)
> > > +			mutex_lock(&sbi->writepages);
> > >  		err = __allocate_data_blocks(inode, offset, count);
> > > +		if (serialized)
> > > +			mutex_unlock(&sbi->writepages);
> > >  		if (err)
> > >  			goto out;
> > >  	}
> > >
> > >  	err = blockdev_direct_IO(iocb, inode, iter, offset, get_data_block_dio);
> > >  out:
> > > -	if (err < 0 && iov_iter_rw(iter) == WRITE)
> > > +	if (err < 0 && rw == WRITE)
> > >  		f2fs_write_failed(mapping, offset + count);
> > >
> > > -	trace_f2fs_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), err);
> > > +	trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
> > >
> > >  	return err;
> > >  }
> > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > > index a339508..293dc4e 100644
> > > --- a/fs/f2fs/f2fs.h
> > > +++ b/fs/f2fs/f2fs.h
> > > @@ -333,6 +333,8 @@ enum {
> > >
> > >  #define MAX_DIR_RA_PAGES	4	/* maximum ra pages of dir */
> > >
> > > +#define DEF_SERIALIZED_DIO_PAGES	64	/* default serialized dio pages */
> > > +
> > >  /* vector size for gang look-up from extent cache that consists of radix tree */
> > >  #define EXT_TREE_VEC_SIZE	64
> > >
> > > @@ -784,6 +786,7 @@ struct f2fs_sb_info {
> > >  	unsigned int total_valid_inode_count;	/* valid inode count */
> > >  	int active_logs;			/* # of active logs */
> > >  	int dir_level;				/* directory level */
> > > +	int serialized_dio_pages;		/* serialized direct IO pages */
> > >
> > >  	block_t user_block_count;		/* # of user blocks */
> > >  	block_t total_valid_block_count;	/* # of valid blocks */
> > > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> > > index a2e3a8f..4a2e51e 100644
> > > --- a/fs/f2fs/super.c
> > > +++ b/fs/f2fs/super.c
> > > @@ -218,6 +218,7 @@ F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ram_thresh, ram_thresh);
> > >  F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ra_nid_pages, ra_nid_pages);
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, max_victim_search, max_victim_search);
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, dir_level, dir_level);
> > > +F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, serialized_dio_pages, serialized_dio_pages);
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, cp_interval, cp_interval);
> > >
> > >  #define ATTR_LIST(name) (&f2fs_attr_##name.attr)
> > > @@ -234,6 +235,7 @@ static struct attribute *f2fs_attrs[] = {
> > >  	ATTR_LIST(min_fsync_blocks),
> > >  	ATTR_LIST(max_victim_search),
> > >  	ATTR_LIST(dir_level),
> > > +	ATTR_LIST(serialized_dio_pages),
> > >  	ATTR_LIST(ram_thresh),
> > >  	ATTR_LIST(ra_nid_pages),
> > >  	ATTR_LIST(cp_interval),
> > > @@ -1125,6 +1127,7 @@ static void init_sb_info(struct f2fs_sb_info *sbi)
> > >  		atomic_set(&sbi->nr_pages[i], 0);
> > >
> > >  	sbi->dir_level = DEF_DIR_LEVEL;
> > > +	sbi->serialized_dio_pages = DEF_SERIALIZED_DIO_PAGES;
> > >  	sbi->cp_interval = DEF_CP_INTERVAL;
> > >  	clear_sbi_flag(sbi, SBI_NEED_FSCK);
> > >
> > > --
> > > 2.6.3
> > >
> > >
> > >
> > > ------------------------------------------------------------------------------
> > > _______________________________________________
> > > Linux-f2fs-devel mailing list
> > > Linux-f2fs-devel@...ts.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel