linux-ext4 - Re: [RFC PATCH] ext4: fix 50% disk write performance regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4C7BE4DD.1060208@redhat.com>
Date:	Mon, 30 Aug 2010 12:05:33 -0500
From:	Eric Sandeen <sandeen@...hat.com>
To:	Bill Fink <billfink@...dspring.com>
CC:	tytso@....edu, adilger@....com, linux-ext4@...r.kernel.org,
	bill.fink@...a.gov
Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:
> A 50% ext4 disk write performance regression was introduced
> in 2.6.32 and still exists in 2.6.35, although somewhat improved
> from 2.6.32.  Read performance was not affected).
> 
> 2.6.31 disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 49.7106 s, 691 MB/s
> 
> 2.6.32 disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 100.395 s, 342 MB/s
> 
> 2.6.35 disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 75.7265 s, 454 MB/s
> 
> A git bisect targetted commit 55138e0bc29c0751e2152df9ad35deea542f29b3
> (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks).
> Specifically the performance issue is caused by the use of the function
> ext4_num_dirty_pages.
> 
> The included patch avoids calling ext4_num_dirty_pages
> (and removes its definition) by unconditionally setting
> desired_nr_to_write to wbc->nr_to_write * 8.
> 
> With the patch, the disk write performance is back to
> approximately 2.6.31 performance levels.

Firstly, thanks very much for tracking that down.  I've had various &
sundry reports of slowdowns but I'd never really gotten to the bottom
of it with a simple testcase somehow.

When I get some time (soon I hope) I'll look into the ramifications
of this change (i.e. what if wbc->nr_to_write * 8 is more than the dirty
pages, do things work out ok?) but it seems pretty reasonable.

Since the commit was Ted's originally, perhaps he has some more
immediate comments.

Thanks a ton!

-Eric

> 2.6.35+patch disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 50.7234 s, 677 MB/s
> 
> Since I'm no expert in this area, I'm submitting this
> RFC patch against 2.6.35.  I'm not sure what all the
> ramifications of my suggested change would be.  However,
> to my admittedly novice eyes, it doesn't seem to be an
> unreasonable change.  Also, subjectively from building
> kernels on a RAID5 ext4 filesystem using the patched
> 2.6.35 kernel (via make -j 8), I didn't notice any issues,
> and it actually seemed more responsive than when using
> the unpatched 2.6.35 kernel.
> 
> 					-Bill
> 
> P.S.  I am not subscribed to the linux-ext4 e-mail list,
>       plus this is my very first attempted linux kernel
>       patch submission.
> 
> 
> 
> Partially revert 55138e0bc29c0751e2152df9ad35deea542f29b3
> (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks)
> to fix a 50% ext4 disk write performance regression introduced
> between 2.6.31 and 2.6.32.
> 
> Signed-off-by: Bill Fink <bill.fink@...a.gov>
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 42272d6..f6e639b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1143,64 +1143,6 @@ static int check_block_validity(struct inode *inode, const char *func,
>  }
>  
>  /*
> - * Return the number of contiguous dirty pages in a given inode
> - * starting at page frame idx.
> - */
> -static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> -				    unsigned int max_pages)
> -{
> -	struct address_space *mapping = inode->i_mapping;
> -	pgoff_t	index;
> -	struct pagevec pvec;
> -	pgoff_t num = 0;
> -	int i, nr_pages, done = 0;
> -
> -	if (max_pages == 0)
> -		return 0;
> -	pagevec_init(&pvec, 0);
> -	while (!done) {
> -		index = idx;
> -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -					      PAGECACHE_TAG_DIRTY,
> -					      (pgoff_t)PAGEVEC_SIZE);
> -		if (nr_pages == 0)
> -			break;
> -		for (i = 0; i < nr_pages; i++) {
> -			struct page *page = pvec.pages[i];
> -			struct buffer_head *bh, *head;
> -
> -			lock_page(page);
> -			if (unlikely(page->mapping != mapping) ||
> -			    !PageDirty(page) ||
> -			    PageWriteback(page) ||
> -			    page->index != idx) {
> -				done = 1;
> -				unlock_page(page);
> -				break;
> -			}
> -			if (page_has_buffers(page)) {
> -				bh = head = page_buffers(page);
> -				do {
> -					if (!buffer_delay(bh) &&
> -					    !buffer_unwritten(bh))
> -						done = 1;
> -					bh = bh->b_this_page;
> -				} while (!done && (bh != head));
> -			}
> -			unlock_page(page);
> -			if (done)
> -				break;
> -			idx++;
> -			num++;
> -			if (num >= max_pages)
> -				break;
> -		}
> -		pagevec_release(&pvec);
> -	}
> -	return num;
> -}
> -
> -/*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
>   *
> @@ -2972,15 +2914,10 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	 * contiguous.  Unfortunately this brings us to the second
>  	 * stupidity, which is that ext4's mballoc code only allocates
>  	 * at most 2048 blocks.  So we force contiguous writes up to
> -	 * the number of dirty blocks in the inode, or
> -	 * sbi->max_writeback_mb_bump whichever is smaller.
> +	 * sbi->max_writeback_mb_bump
>  	 */
>  	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> -	if (!range_cyclic && range_whole)
> -		desired_nr_to_write = wbc->nr_to_write * 8;
> -	else
> -		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> -							   max_pages);
> +	desired_nr_to_write = wbc->nr_to_write * 8;
>  	if (desired_nr_to_write > max_pages)
>  		desired_nr_to_write = max_pages;
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html