linux-kernel - Re: [RFC] Block IO Controller V2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091116211412.GJ13235@redhat.com>
Date:	Mon, 16 Nov 2009 16:14:12 -0500
From:	Vivek Goyal <vgoyal@...hat.com>
To:	"Alan D. Brunelle" <Alan.Brunelle@...com>
Cc:	linux-kernel@...r.kernel.org, jens.axboe@...cle.com
Subject: Re: [RFC] Block IO Controller V2 - some results

On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
> Hi Vivek: 
> 
> I'm finding some things that don't quite seem right - executive
> summary: 

Hi Alan,

Thanks a lot for such an extensive testing and test results. I am still
digesting the results but I thought I will make a quick note about writes.
This patchset works only for sync IO. If you are performing buffered
writes then you will not see any service differentiation. Providing
support for buffered write path is in TODO list. 

> 
> o  I think the apportionment algorithm doesn't work consistently well
> for writes.
> 
> o  I think there are problems with significant performance loss when
> doing random I/Os.

This concerns me. I had a quick look and as per your results, even with
group_idle=0 you are seeing this regression. I guess this might be coming
from the fact that we idle on sync-noidle workload per group and that
idling becomes significant as number of groups increase.

Thanks
Vivek

> 
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> 
> Test configuration: HP dl585 (32-way quad-core AMD Opteron processors +
> 128GB RAM + 4 FC HBAs + 4 MSA 1000's (each exporting 3 multi-disk
> striped LUNs)). Running with Jens Axboe's remotes/origin/for-2.6.33
> branch (at commit 8721c81f6480e2c9acbf92078383953f825d1057)) w/out and
> w/ your V2 patch.
> 
> The test: 12 Ext3 file systems (1 per disk), each file system has eight
> 8GB files on it. Doing simple fio runs in various modes and I/O
> directions: random or sequential, read or write or read/write (80%/20%).
> Using 2, 4 or 8 processes per file system (each process working on a
> different file). Here is a sample fio command file:
> 
> [global]
> ioengine=sync
> size=8g
> overwrite=0
> runtime=120
> bs=256k
> readwrite=write
> [/mnt/sdl/data.7]
> filename=/mnt/sdl/data.7
> 
> I'm then using cgroups that have IO weights as follows:
> 
> /cgroup/test0/blkio.weight 100
> /cgroup/test1/blkio.weight 200
> /cgroup/test2/blkio.weight 300
> /cgroup/test3/blkio.weight 400
> /cgroup/test4/blkio.weight 500
> /cgroup/test5/blkio.weight 600
> /cgroup/test6/blkio.weight 700
> /cgroup/test7/blkio.weight 800
> 
> There were 12 X N total processes running in the system for each test,
> and each file system would have N process working on a different file in
> that file system. The N processes would be assigned to increasing test
> groups: process 0 will be in test0's group and working on file 0 in a
> file system; process 1 will be in test1's group and working on file 1 in
> a file system; and so on.
> 
> Before each test I drop caches & umount/mount the filesystem anew.
> 
> In the following tables:
> 
> 'base' - means a kernel generated from Jens' branch (-no- patching)
> 
> 'ioc off' - means a kernel generated w/ your patches added but -no-
> other settings (no CGROUP stuff mounted or enabled)
> 
> 'ioc no idle' - means the ioc kernel w/ CGROUP stuff enabled
> -but- /sys/block/sd*/queue/iosched/cgroup_idle = 0
> 
> 'ioc idle' - means the ioc kernel w/ CGROUP stuff enabled
> -and- /sys/block/sd*/queue/iosched/cgroup_idle = 1
> 
> Modes: random or sequential
> 
> RdWr: rd==read, wr==write, rdwr==80%read & 20%write
> 
> N: Number of processes per disk
> 
> testX: Processes sharing a task group (when enabled)
> 
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> 
> The first thing to do is to check for correctness: when the I/O
> controller is enabled do we see correctly apportioned I/O?
> 
> At the tail end of the e-mail I've placed three (3) tables showing the
> state where -no- differences should be seen between the various "task"
> groups in terms of performance ("level playing field"), and sure enough
> no differences were seen. These were done basically as a "control" set
> of tests - the script being used didn't have any inherent biases in
> it.[1]
> 
> This table shows the cases where we should see a difference based upon
> weights:
> 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>        Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>    ioc idle  rnd   rd 2   2.8   6.3 
>    ioc idle  rnd   rd 4   0.7   1.5   2.5   3.5 
>    ioc idle  rnd   rd 8   0.2   0.4   0.5   0.8   0.9   1.2   1.4   1.7 
> 
>    ioc idle  rnd   wr 2  38.2 192.7 
>    ioc idle  rnd   wr 4   1.0  17.7  38.1 204.5 
>    ioc idle  rnd   wr 8   0.3   0.6   0.9   1.5   2.2  16.3  16.6 208.3 
> 
>    ioc idle  rnd rdwr 2   4.9  11.3 
>    ioc idle  rnd rdwr 4   0.9   2.4   4.3   6.2 
>    ioc idle  rnd rdwr 8   0.2   0.5   0.8   1.1   1.4   1.8   2.2   2.7 
> 
> 
>    ioc idle  seq   rd 2 221.0 386.4 
>    ioc idle  seq   rd 4  69.8 128.1 183.2 226.8 
>    ioc idle  seq   rd 8  21.4  40.0  55.6  70.8  85.2  98.3 111.6 121.9 
> 
>    ioc idle  seq   wr 2 398.6 391.6 
>    ioc idle  seq   wr 4 219.0 214.5 214.1 214.5 
>    ioc idle  seq   wr 8 107.6 106.8 104.7 102.5  99.5  99.5 100.5 100.8 
> 
>    ioc idle  seq rdwr 2 196.8 340.9 
>    ioc idle  seq rdwr 4  64.0 109.6 148.7 183.5 
>    ioc idle  seq rdwr 8  22.6  36.6  48.8  61.1  70.3  78.5  84.9  94.3 
> 
> In general, we do see weights associated in correctly increasing order,
> but I don't think the proportions are done correctly in all cases.
> 
> In the random tests for example, the read distribution looks pretty
> decent, but random writes are all off - for some reason the highest
> priority (most heavily weighted) is getting a disproportionately large
> percentage of the I/O bandwidth.
> 
> For the sequential loads, the reads look "OK" - not quite correctly fair
> when we have 8 processes running against the devices, but on the whole
> things look ok. Sequential writes are not working well at all:
> relatively flat distribution. 
> 
> I _think_ this is pointing to some real problems in both the write cases
> for both random & sequential I/Os.
> 
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> 
> The next thing to look at is to see what the "penalty" is for the
> additional code: see how much bandwidth we lose for the capability
> added. Here we see the sum of the system's throughput for the various
> tests:
> 
> ---- ---- - ----------- ----------- ----------- ----------- 
> Mode RdWr N    base       ioc off   ioc no idle  ioc idle   
> ---- ---- - ----------- ----------- ----------- ----------- 
>  rnd   rd 2        17.3        17.1         9.4         9.1 
>  rnd   rd 4        27.1        27.1         8.1         8.2 
>  rnd   rd 8        37.1        37.1         6.8         7.1 
> 
>  rnd   wr 2       296.5       243.7       290.2       230.9 
>  rnd   wr 4       287.3       280.7       270.4       261.3 
>  rnd   wr 8       272.5       273.1       237.7       246.5 
> 
>  rnd rdwr 2        27.4        27.7        16.1        16.2 
>  rnd rdwr 4        38.3        39.3        13.5        13.9 
>  rnd rdwr 8        62.0        61.5        10.0        10.7 
> 
>  seq   rd 2       610.2       608.1       610.7       607.4 
>  seq   rd 4       608.4       601.5       609.3       608.0 
>  seq   rd 8       605.7       603.7       605.0       604.8 
> 
>  seq   wr 2       840.3       850.2       836.8       790.2 
>  seq   wr 4       886.8       891.6       868.2       862.2 
>  seq   wr 8       865.1       887.1       832.1       822.0 
> 
>  seq rdwr 2       536.2       550.0       538.1       537.7 
>  seq rdwr 4       595.3       605.7       512.9       505.8 
>  seq rdwr 8       617.3       628.5       526.6       497.1
> 
> The sequential runs look very good - not much variance across the board.
> 
> The random results look horrible, especially when reads are involved:
> The first two columns (base & ioc off) are very similar, however note
> the significant drop in overall system performance once the
> io-controller CGROUP stuff gets involved - the more processes involved
> the more performance is lost. 
> 
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> 
> I'm going to spend some time drilling down into three specific tests:
> 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>        Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>    ioc idle  rnd   wr 2  38.2 192.7 
>    ioc idle  seq   wr 2 398.6 391.6 
> 
> This test I can use to see why random writes are so disproportionately
> apportioned - it should be 2-to-1 but we are seeing something like
> 6-to-1. And then I can look at why sequential writes are flat.
> 
> and:
> 
> ---- ---- - ----------- ----------- ----------- ----------- 
> Mode RdWr N    base       ioc off   ioc no idle  ioc idle   
> ---- ---- - ----------- ----------- ----------- ----------- 
>  rnd   rd 2        17.3        17.1         9.4         9.1 
> 
> I will try to find out why we are seeing such a loss in system
> performance...
> 
> Regards,
> Alan D. Brunelle
> Hewlett-Packard / Linux Kernel Technology Team
> 
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> [1] Three tables showing the I/O load distributed when either there was
> no I/O controller code or when it was turned off or when cgroup_idle was
> turned off. All looks sane - with the exception of the ioc-enabled
> kernel with no-idle set - for random writes it appears like there is
> some differences, but not an appreciable amount?
> 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>        Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>        base  rnd   rd 2   8.6   8.6 
>        base  rnd   rd 4   6.8   6.8   6.8   6.7 
>        base  rnd   rd 8   4.7   4.6   4.6   4.6   4.6   4.6   4.6   4.6 
> 
>        base  rnd   wr 2 150.4 146.1 
>        base  rnd   wr 4  75.2  74.8  68.1  69.2 
>        base  rnd   wr 8  36.2  39.3  29.6  35.9  32.9  37.0  29.6  32.2 
> 
>        base  rnd rdwr 2  13.7  13.7 
>        base  rnd rdwr 4   9.6   9.6   9.6   9.6 
>        base  rnd rdwr 8   7.8   7.8   7.7   7.8   7.8   7.7   7.7   7.8 
> 
> 
>        base  seq   rd 2 306.2 304.0 
>        base  seq   rd 4 150.1 152.4 151.9 154.0 
>        base  seq   rd 8  77.2  75.9  75.9  73.9  77.0  75.7  75.0  74.9 
> 
>        base  seq   wr 2 420.2 420.1 
>        base  seq   wr 4 220.5 222.5 221.9 221.9 
>        base  seq   wr 8 108.2 108.8 107.8 107.7 108.7 108.5 108.1 107.2 
> 
>        base  seq rdwr 2 268.4 267.8 
>        base  seq rdwr 4 148.9 150.6 147.8 148.0 
>        base  seq rdwr 8  78.0  77.7  76.3  76.0  79.1  77.9  74.3  77.9 
> 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>        Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>     ioc off  rnd   rd 2   8.6   8.6 
>     ioc off  rnd   rd 4   6.8   6.8   6.7   6.7 
>     ioc off  rnd   rd 8   4.7   4.6   4.6   4.7   4.6   4.6   4.6   4.6 
> 
>     ioc off  rnd   wr 2 112.6 131.1 
>     ioc off  rnd   wr 4  64.9  67.8  79.9  68.1 
>     ioc off  rnd   wr 8  35.1  39.5  31.5  32.0  36.1  34.5  30.8  33.5 
> 
>     ioc off  rnd rdwr 2  13.8  13.8 
>     ioc off  rnd rdwr 4   9.8   9.8   9.9   9.8 
>     ioc off  rnd rdwr 8   7.7   7.7   7.7   7.7   7.7   7.7   7.7   7.7 
> 
> 
>     ioc off  seq   rd 2 303.1 305.0 
>     ioc off  seq   rd 4 150.8 151.6 149.0 150.2 
>     ioc off  seq   rd 8  77.0  76.3  74.5  74.0  77.9  75.5  74.0  74.6 
> 
>     ioc off  seq   wr 2 424.6 425.5 
>     ioc off  seq   wr 4 223.0 222.4 223.9 222.3 
>     ioc off  seq   wr 8 110.8 112.0 111.3 109.6 111.7 111.3 110.8 109.7 
> 
>     ioc off  seq rdwr 2 274.3 275.8 
>     ioc off  seq rdwr 4 151.3 154.8 149.0 150.6 
>     ioc off  seq rdwr 8  81.1  80.6  77.8  74.8  81.0  78.5  77.0  77.7
> 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
>        Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
> ioc no idle  rnd   rd 2   4.7   4.7 
> ioc no idle  rnd   rd 4   2.0   2.0   2.0   2.0 
> ioc no idle  rnd   rd 8   0.9   0.9   0.8   0.8   0.8   0.8   0.9   0.9 
> 
> ioc no idle  rnd   wr 2 144.8 145.4 
> ioc no idle  rnd   wr 4  73.2  65.9  65.5  65.8 
> ioc no idle  rnd   wr 8  35.5  52.5  26.2  31.0  25.5  19.3  25.1  22.6 
> 
> ioc no idle  rnd rdwr 2   8.1   8.1 
> ioc no idle  rnd rdwr 4   3.4   3.4   3.4   3.4 
> ioc no idle  rnd rdwr 8   1.3   1.3   1.3   1.2   1.2   1.3   1.2   1.3 
> 
> 
> ioc no idle  seq   rd 2 304.1 306.6 
> ioc no idle  seq   rd 4 152.1 154.5 149.8 153.0 
> ioc no idle  seq   rd 8  75.8  75.8  75.2  75.1  75.5  75.3  75.7  76.5 
> 
> ioc no idle  seq   wr 2 418.6 418.2 
> ioc no idle  seq   wr 4 217.7 217.7 215.4 217.4 
> ioc no idle  seq   wr 8 105.5 105.8 105.8 103.4 102.9 103.1 102.7 102.8 
> 
> ioc no idle  seq rdwr 2 269.2 269.0 
> ioc no idle  seq rdwr 4 130.0 126.4 127.8 128.6 
> ioc no idle  seq rdwr 8  67.2  66.6  65.4  65.0  65.3  64.8  65.7  66.5 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/