linux-kernel - Re: write-behind on streaming writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120606170428.GB8133@redhat.com>
Date:	Wed, 6 Jun 2012 13:04:28 -0400
From:	Vivek Goyal <vgoyal@...hat.com>
To:	Fengguang Wu <fengguang.wu@...el.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	"Myklebust, Trond" <Trond.Myklebust@...app.com>,
	linux-fsdevel@...r.kernel.org,
	Linux Memory Management List <linux-mm@...ck.org>,
	Jens Axboe <axboe@...nel.dk>
Subject: Re: write-behind on streaming writes

On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@...hat.com> wrote:
> > > >
> > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > see it here).
> > > 
> > > How did interactivity feel?
> > > 
> > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > seconds, I suspect the interactivity thing is what dominates.
> > > 
> > > And from my memory of the interactivity different was absolutely
> > > *huge*. Even back when I used rotational media, I basically couldn't
> > > even notice the background write with the sync_file_range() approach.
> > > While the regular writeback without the writebehind had absolutely
> > > *huge* pauses if you used something like firefox that uses fsync()
> > > etc. And starting new applications that weren't cached was noticeably
> > > worse too - and then with sync_file_range it wasn't even all that
> > > noticeable.
> > > 
> > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > the writeback on the same filesystem (and obviously disk) as your home
> > > directory is. If the big write is to another filesystem and another
> > > disk, I think you won't see the same issues.
> > 
> > Ok, I did following test on my single SATA disk and my root filesystem
> > is on this disk.
> > 
> > I dropped caches and launched firefox and monitored the time it takes
> > for firefox to start. (cache cold).
> > 
> > And my results are reverse of what you have been seeing. With
> > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > approximated the average of 3 runs for simplicity).
> > 
> > I think it is happening because sync_file_range() will send all
> > the writes as SYNC and it will compete with firefox IO. On the other
> > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > heavily and firefox's IO will be prioritized. And this effect should
> > just get worse as more processes do sync_file_range().
> > 
> > So write-behind should provide better interactivity if writes submitted
> > are ASYNC and not SYNC.
> 
> Hi Vivek, thanks for testing all of these out! The result is
> definitely interesting and a surprise: we overlooked the SYNC nature
> of sync_file_range().
> 
> I'd suggest to use these calls to achieve the write-and-drop-behind
> behavior, *with* WB_SYNC_NONE:
> 
>         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
>         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> 
> The caveat is, the below bdi_write_congested() will never evaluate to
> true since we are only filling the request queue with 8MB data.
> 
> SYSCALL_DEFINE(fadvise64_64):
> 
>         case POSIX_FADV_DONTNEED:
>                 if (!bdi_write_congested(mapping->backing_dev_info))
>                         __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                                    WB_SYNC_NONE);

Hi Fengguang,

Instead of above, I modified sync_file_range() to call __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes showing up at elevator.

With 4 processes doing sync_file_range() now, firefox start time test
clocks around 18-19 seconds which is better than 30-35 seconds of 4
processes doing buffered writes. And system looks pretty good from
interactivity point of view.

Thanks
Vivek

Following is the patch I applied to test.

---
 fs/sync.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
+++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
@@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
-		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
+		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
 		if (ret < 0)
 			goto out_put;
 	}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/