linux-kernel - Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150917003738.GN3902@dastard>
Date:	Thu, 17 Sep 2015 10:37:38 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Chris Mason <clm@...com>, Jan Kara <jack@...e.cz>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Josef Bacik <jbacik@...com>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	Neil Brown <neilb@...e.de>, Christoph Hellwig <hch@....de>
Cc:	Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

[cc Tejun]

On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > instead of 30, I suspect that would even out the average sizes between
> > > > the three patches.
> > > 
> > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > right?
> > 
> > Looks like it works well to me.
> 
> Graph looks good, though I'll confirm it on my test rig once I get
> out from under the pile of email and other stuff that is queued up
> after being away for a week...

I ran some tests in the background while reading other email.....

TL;DR: Results look really bad - not only is the plugging
problematic, baseline writeback performance has regressed
significantly. We need to revert the plugging changes until the
underlying writeback performance regressions are sorted out.

In more detail, these tests were run on my usual 16p/16GB RAM
performance test VM with storage set up as described here:

http://permalink.gmane.org/gmane.linux.kernel/1768786

The test:

$ ~/tests/fsmark-10-4-test-xfs.sh
meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

#  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
#       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0        80000         4096     106938.0           543310
     0       160000         4096     102922.7           476362
     0       240000         4096     107182.9           538206
     0       320000         4096     107871.7           619821
     0       400000         4096      99255.6           622021
     0       480000         4096     103217.8           609943
     0       560000         4096      96544.2           640988
     0       640000         4096     100347.3           676237
     0       720000         4096      87534.8           483495
     0       800000         4096      72577.5          2556920
     0       880000         4096      97569.0           646996

<RAM fills here, sustained performance is now dependent on writeback>

     0       960000         4096      80147.0           515679
     0      1040000         4096     100394.2           816979
     0      1120000         4096      91466.5           739009
     0      1200000         4096      85868.1           977506
     0      1280000         4096      89691.5           715207
     0      1360000         4096      52547.5           712810
     0      1440000         4096      47999.1           685282
     0      1520000         4096      47894.3           697261
     0      1600000         4096      47549.4           789977
     0      1680000         4096      40029.2           677885
     0      1760000         4096      16637.4         12804557
     0      1840000         4096      16883.6         24295975
     0      1920000         4096      18852.1         22264137
     0      2000000         4096      27107.3         10310408
     0      2080000         4096      23654.1         19823567
     0      2160000         4096      24588.9         19769484
     0      2240000         4096      13959.3         40168163
     0      2320000         4096      26239.2         17806813
     0      2400000         4096      17858.5         24596715
     0      2480000         4096      12072.9         38177650
     0      2560000         4096       9989.7         43413729

<This is Real Bad. It gets significantly worse as the test goes on:>

     0      6880000         4096      11773.4         16470400
     0      6960000         4096       5556.5         27325343
     0      7040000         4096      13816.6         22440991
     0      7120000         4096       7322.4         42086686
     0      7200000         4096       8033.0         31182907
     0      7280000         4096      10589.4         30876729
     0      7360000         4096      15844.5         19903105
     0      7440000         4096       9187.7         35041718
     0      7520000         4096      11800.9         12254920
     0      7600000         4096      20049.9         13052020
     0      7680000         4096      17444.9         12055262
     0      7760000         4096      25900.6         10633885
     0      7840000         4096       6398.8         12014901
     0      7920000         4096      12147.4         14655678

<...>

     0      9360000         4096      19406.0          5874193
     0      9440000         4096       7261.3         14542806
     0      9520000         4096      14978.7         21921186
     0      9600000         4096      12656.7         10539894

real    10m9.240s
user    0m51.817s
sys     13m23.874s

A quick reading of the IO behaviour shows an average of about 1000
IOPS and 100MB/s, which matches roughly with the average files/s
created. There's no obvious peaks and gaps in the IO throughput, but
CPU usage has regular gaps where it falls to nothing and the only
"usage" is IO wait time.

Reverting just Linus' patch, so no plugging at all in the code
(i.e. the baseline numbers):

FSUse%        Count         Size    Files/sec     App Overhead
     0        80000         4096     109017.5           586268
     0       160000         4096      93548.1           553333
     0       240000         4096     104141.7           817182
     0       320000         4096     112846.2           597494
     0       400000         4096      98628.6           641524
     0       480000         4096      86586.1          1274195
     0       560000         4096      95965.0           776961
     0       640000         4096      84585.9           583330
     0       720000         4096      93787.1           695658
     0       800000         4096      94112.7           645925
     0       880000         4096      92259.2           792950

<RAM fills here, sustained performance is now dependent on writeback>

     0       960000         4096      62936.9           763642
     0      1040000         4096      22484.8           737471
     0      1120000         4096      22038.6           706791
     0      1200000         4096      24096.5           680878
     0      1280000         4096      23973.1           839826
     0      1360000         4096      24761.9           779486
     0      1440000         4096      25467.7           976022
     0      1520000         4096      25338.7           935552
     0      1600000         4096      27484.8           683351
     0      1680000         4096      24383.2          3039781
     0      1760000         4096      24857.0          2657008

<maintains 13-26k files/s all the way to the end .... >

     0      9040000         4096      21438.6          4869317
     0      9120000         4096      25387.1           642198
     0      9200000         4096      26815.9           727763
     0      9280000         4096      25175.5          4458811
     0      9360000         4096      19832.8           982122
     0      9440000         4096      17643.8           805599
     0      9520000         4096      13923.4          7819152
     0      9600000         4096      18101.3          1071163

real    7m7.926s
user    0m57.976s
sys     14m14.620s

The baseline of no plugging is a full 3 minutes faster than the
plugging behaviour of Linus' patch. The IO behaviour demonstrates
that, sustaining between 25-30,000 IOPS and throughput of
130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
it does not result in a performance improvement like the original
plugging patch did.

So I went back and had a look at my original patch, which I've been
using locally for a couple of years and was similar to the original
commit. It has this description from when I last updated the perf
numbers from testing done on 3.17:

| Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
| metadata CRCs enabled.
| 
| Test:
| 
| $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
| /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
| /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
| /mnt/scratch/6  -d  /mnt/scratch/7
| 
| Result:
|                 wall    sys     create rate     Physical write IO
|                 time    CPU     (avg files/s)    IOPS   Bandwidth
|                 -----   -----   -------------   ------  ---------
| unpatched       5m54s   15m32s  32,500+/-2200   28,000  150MB/s
| patched         3m19s   13m28s  52,900+/-1800    1,500  280MB/s
| improvement     -43.8%  -13.3%    +62.7%        -94.6%  +86.6%

IOWs, what we are seeing here is that the baseline writeback
performance has regressed quite significantly since I took these
numbers back on 3.17.  I'm running on exactly the same test setup;
the only difference is the kernel and so the current kernel baseline
is ~20% slower than the baseline numbers I have in my patch.

So, let's put the original [broken] plugging back in:

.....
     0      8960000         4096      28420.0          2129129
     0      9040000         4096      26863.0          6893094
     0      9120000         4096      46721.4           705927
     0      9200000         4096      46384.0          1081627
     0      9280000         4096      39692.3          2148938
     0      9360000         4096      41976.0          3540489
     0      9440000         4096      28025.1         10364440
     0      9520000         4096      38580.2          2457019
     0      9600000         4096      33526.1          7661316

real    5m12.114s
user    0m50.294s
sys     12m54.425s

While we get back that regression and then some more, it's still
much slower than the high level plugging on slightly older kernels.
And the performance is, again, highly variable through the midpoint
of the test run. e.g:

....
     0      4720000         4096      39224.4          9401401
     0      4800000         4096      25342.9         18889665
     0      4880000         4096      14841.7         36668633
     0      4960000         4096      27624.6         17731873
     0      5040000         4096      37874.8         11390831
     0      5120000         4096      29009.2         12331965
     0      5200000         4096      49568.0          4346746
     0      5280000         4096      83074.5           877142
     0      5360000         4096      32498.8          5401470
     0      5440000         4096      55468.1          3188117
     0      5520000         4096      19156.0          4112120
     0      5600000         4096      44894.0          3186559
....

We have a low of ~15k files/s and a high of 83k files/s whilst
averaging about 38k files/s.

IOWs, ignoring the plugging, the recent changes to the writeback
code have really screwed up writeback performance and changed the
way it behaves from a throughput consistency point of view.  I'd
suggest we should start by reverting all the plugging changes for
now, because there are other problems these plugging changes have
exposed in the current writeback code that need to be fixed first...

Cheers

Dave.

PS: I was going to run this on an ext4 filesystem to check it
wasn't just XFS seeing this, but that reminded me why I don't run
this test on ext4:

FSUse%        Count         Size    Files/sec     App Overhead
     5        80000         4096     211668.3           534504
     5       160000         4096     191731.8           532701
     5       240000         4096     196251.2           489772
     5       320000         4096     207597.1           464032
     5       400000         4096     121616.0           601156
     5       480000         4096     197729.9           573925

<Really, really fast in memory, but background writeback starts here>

     5       560000         4096     128698.4           546362
     5       640000         4096     137904.3           543718
     5       720000         4096      63680.9           498110
     5       800000         4096      25825.9          1538213
     5       880000         4096      32708.7           632235

<dirty page threshold (20% RAM) is hit here. performance sucks from
 here on in (still 10GB of free RAM) because the writeback IO
 pattern is a random 4k write pattern>

     5       960000         4096       6940.2          7250666
     5      1040000         4096       7474.8           711872
     5      1120000         4096       4127.5         14034167
     5      1200000         4096       4822.1           752398
....

Hence it takes about around hour to do a single run of this test on
ext4 because it's delayed allocation algorithm does not linearise
writes to small files with similar inode locality like XFS does, and
that's the behaviour that high level plugging really helps. IOws,
this specific workload does not demonstrate any significant
performance difference on ext4 with the different plugging
algorithms because it does random IO rather than sequential IO.

I also ran it on btrfs, but it's pushing peaks of 15-25,000 IOPS
every 30s regardless of whether there is plugging at a high level or
not, so again btrfs behaviour is pretty much unaffected by the high
level plugging on this workload....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/