[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56FD344F.70908@fb.com>
Date: Thu, 31 Mar 2016 08:29:35 -0600
From: Jens Axboe <axboe@...com>
To: Dave Chinner <david@...morbit.com>
CC: <linux-kernel@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
<linux-block@...r.kernel.org>
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 02:24 AM, Dave Chinner wrote:
> On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
>> Hi,
>>
>> This patchset isn't as much a final solution, as it's demonstration
>> of what I believe is a huge issue. Since the dawn of time, our
>> background buffered writeback has sucked. When we do background
>> buffered writeback, it should have little impact on foreground
>> activity. That's the definition of background activity... But for as
>> long as I can remember, heavy buffered writers has not behaved like
>> that. For instance, if I do something like this:
>>
>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>
>> on my laptop, and then try and start chrome, it basically won't start
>> before the buffered writeback is done. Or, for server oriented
>> workloads, where installation of a big RPM (or similar) adversely
>> impacts data base reads or sync writes. When that happens, I get people
>> yelling at me.
>>
>> Last time I posted this, I used flash storage as the example. But
>> this works equally well on rotating storage. Let's run a test case
>> that writes a lot. This test writes 50 files, each 100M, on XFS on
>> a regular hard drive. While this happens, we attempt to read
>> another file with fio.
>>
>> Writers:
>>
>> $ time (./write-files ; sync)
>> real 1m6.304s
>> user 0m0.020s
>> sys 0m12.210s
>
> Great. So a basic IO tests looks good - let's through something more
> complex at it. Say, a benchmark I've been using for years to stress
> the Io subsystem, the filesystem and memory reclaim all at the same
> time: a concurent fsmark inode creation test.
> (first google hit https://lkml.org/lkml/2013/9/10/46)
Is that how you are invoking it as well same arguments?
> This generates thousands of REQ_WRITE metadata IOs every second, so
> iif I understand how the throttle works correctly, these would be
> classified as background writeback by the block layer throttle.
> And....
>
> FSUse% Count Size Files/sec App Overhead
> 0 1600000 0 255845.0 10796891
> 0 3200000 0 261348.8 10842349
> 0 4800000 0 249172.3 14121232
> 0 6400000 0 245172.8 12453759
> 0 8000000 0 201249.5 14293100
> 0 9600000 0 200417.5 29496551
>>>>> 0 11200000 0 90399.6 40665397
> 0 12800000 0 212265.6 21839031
> 0 14400000 0 206398.8 32598378
> 0 16000000 0 197589.7 26266552
> 0 17600000 0 206405.2 16447795
>>>>> 0 19200000 0 99189.6 87650540
> 0 20800000 0 249720.8 12294862
> 0 22400000 0 138523.8 47330007
>>>>> 0 24000000 0 85486.2 14271096
> 0 25600000 0 157538.1 64430611
> 0 27200000 0 109677.8 47835961
> 0 28800000 0 207230.5 31301031
> 0 30400000 0 188739.6 33750424
> 0 32000000 0 174197.9 41402526
> 0 33600000 0 139152.0 100838085
> 0 35200000 0 203729.7 34833764
> 0 36800000 0 228277.4 12459062
>>>>> 0 38400000 0 94962.0 30189182
> 0 40000000 0 166221.9 40564922
>>>>> 0 41600000 0 62902.5 80098461
> 0 43200000 0 217932.6 22539354
> 0 44800000 0 189594.6 24692209
> 0 46400000 0 137834.1 39822038
> 0 48000000 0 240043.8 12779453
> 0 49600000 0 176830.8 16604133
> 0 51200000 0 180771.8 32860221
>
> real 5m35.967s
> user 3m57.054s
> sys 48m53.332s
>
> In those highlighted report points, the performance has dropped
> significantly. The typical range I expect to see ionce memory has
> filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla
> kernel was 4m40s and there were no performance drops, so this
> workload runs almost a minute slower with the block layer throttling
> code.
>
> What I see in these performance dips is the XFS transaction
> subsystem stalling *completely* - instead of running at a steady
> state of around 350,000 transactions/s, there are *zero*
> transactions running for periods of up to ten seconds. This
> co-incides with the CPU usage falling to almost zero as well.
> AFAICT, the only thing that is running when the filesystem stalls
> like this is memory reclaim.
I'll take a look at this, stalls should definitely not be occurring. How
much memory does the box have?
> Without the block throttling patches, the workload quickly finds a
> steady state of around 7.5-8.5 million cached inodes, and it doesn't
> vary much outside those bounds. With the block throttling patches,
> on every transaction subsystem stall that occurs, the inode cache
> gets 3-4 million inodes trimmed out of it (i.e. half the
> cache), and in a couple of cases I saw it trim 6+ million inodes from
> the cache before the transactions started up and the cache started
> growing again.
>
>> The above was run without scsi-mq, and with using the deadline scheduler,
>> results with CFQ are similary depressing for this test. So IO scheduling
>> is in place for this test, it's not pure blk-mq without scheduling.
>
> virtio in guest, XFS direct IO -> no-op -> scsi in host.
That has write back caching enabled on the guest, correct?
--
Jens Axboe
Powered by blists - more mailing lists