linux-ext4 - streaming read and write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4278.1214242602@alphaville.zko.hp.com>
Date:	Mon, 23 Jun 2008 13:36:42 -0400
From:	Nick Dokos <nicholas.dokos@...com>
To:	linux-ext4@...r.kernel.org
Cc:	nicholas.dokos@...com
Subject: streaming read and write - test results

Here are some preliminary results from streaming reads and writes
on ext4. The test rig consists of the following:

o an 8-cpu Opteron box (4 slots, dual-core) with 64GB of memory.

o 8 MSA1000 RAID controllers, each with four back-end SCSI busses, each
bus with 7 300GB 15K disks. Only one bus on each controller is used for
the tests below (the rest are going to be used for larger filesystem
testing). The 7 disks are striped ("horizontally" @ 128KB stripe size) at
the hardware level and exported as a single 2TB LUN (that's the current
hardware LUN size limit).

o the 8 LUNs are striped ("vertically" @ 128KB also) at the LVM level to
produce a 16TB logical volume.

o the kernels used were 2.6.26-rc3 and 2.6.26-rc5, each patched with the
then current ext4 patch queue.

o e2fsprogs were 1.41-WIP.

o the test used was aiod (http://sourceforge.net/projects/aiod/) with the
  following command line:

  aiod -S -B -b4M -I -v -w 3500000 -W|-R

  (aiod uses AIO by default, but reverts back to ordinary read/write with -S
  - note that the documentation calls this "sync-io" but that's a
  misnomer: there is nothing synchronous about it, it just means non-AIO;
  aiod uses directIO by default, but reverts to buffered IO with -B;
  -b4M makes aiod issue 4MB IOs and -w <N> makes it issue that many IOs.)

  The test first sequentially writes a ~14TB (4MB*3500000) file,
  unmounts the fs, remounts it and then sequentially reads the file back.

o There were two write tests: in the first,  the fs was mounted with default options;
  in the second with mntopts=delalloc,journal_async_commit.

o The readahead setting was changed by doing a blockdev --setra on the LVM special
  file, /dev/mapper/<vgname>-<lvname>. The default RA was 256.


-------------------------------------------------------------------------------
Throughput results (in MB/s):
                
Ext4         Read   Read (RA=4K)    Read (RA=64K)    Write     Write(+mntopts)
----         ----   ------------    -------------    -----     ---------------
2.6.26-rc3    122                                      144

2.6.26-rc5    117            509 [1]          570                      295 [2]
                                                                       281

Ext3         Read                                    Write
----         ----                                    -----
2.6.26-rc3    131                                      155 [3]



Footnotes:
-----
[1] Typical iostat output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   14.50    0.57    0.00   84.92

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              56.04     0.00 449.05  0.00 129304.30     0.00   287.95     1.51    3.37   2.00  89.59
sde              61.14     0.00 443.96  0.00 129304.30     0.00   291.25     1.51    3.41   2.02  89.63
sdi              60.54     0.00 444.56  0.00 129304.30     0.00   290.86     1.50    3.38   2.02  89.71
sdm              61.14     0.00 443.96  0.00 129304.30     0.00   291.25     1.50    3.39   2.02  89.75
sdq              62.94     0.00 442.16  0.00 129304.30     0.00   292.44     1.49    3.36   2.03  89.67
sdu              64.24     0.00 440.86  0.00 129304.30     0.00   293.30     1.49    3.37   2.03  89.51
sdy              63.64     0.00 441.56  0.00 129304.30     0.00   292.84     1.49    3.37   2.03  89.83
sdac             61.84     0.00 443.36  0.00 129304.30     0.00   291.65     1.50    3.37   2.03  89.99
dm-0              0.00     0.00 2020.38  0.00 1034434.37     0.00   512.00     7.05    3.48   0.46  92.11


and top output (only %CPU > 0 shown):


top - 15:07:20 up 2 days, 20:08,  0 users,  load average: 1.12, 1.38, 1.33
Tasks: 189 total,   1 running, 188 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 13.1%sy,  0.0%ni, 84.8%id,  0.6%wa,  0.2%hi,  1.2%si,  0.0%st
Mem:  66114888k total, 66005660k used,   109228k free,     4196k buffers
Swap:  2040212k total,     3884k used,  2036328k free, 19773724k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
  80046 root      20   0 18808 4796  564 S   86  0.0 186:21.78 aiod             
    391 root      15  -5     0    0    0 S   10  0.0 196:19.05 kswapd1          
    392 root      15  -5     0    0    0 S    8  0.0 178:58.83 kswapd2          
    393 root      15  -5     0    0    0 S    7  0.0 166:29.31 kswapd3          
    135 root      15  -5     0    0    0 S    1  0.0   1:06.74 kblockd/6        
    ....

-----
[2] Two runs made: it's not clear why the difference is so large (~5%).

During both runs, iostat showed fairly constant and even spread among
the disks (sort-of: sda and sde form one group and the others form
another - the difference between the two groups stayed pretty constant
through the run). Here is a typical entry:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   13.01    2.65    0.00   84.34

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   252.15  0.00 95.80     0.00 75623.58   789.36     0.29    3.04   3.01  28.81
sde               0.00   252.35  0.00 95.60     0.00 75623.58   791.01     0.29    3.07   2.98  28.53
sdi               0.00   185.61  0.00 162.24     0.00 75623.58   466.13     0.51    3.11   1.80  29.21
sdm               0.00   185.31  0.00 162.64     0.00 75623.58   464.98     0.50    3.05   1.77  28.73
sdq               0.00   187.71  0.00 162.54     0.00 75642.76   465.39     0.52    3.17   1.81  29.45
sdu               0.00   185.61  0.00 162.44     0.00 75624.38   465.56     0.51    3.12   1.79  29.09
sdy               0.00   188.41  2.50 167.73    19.98 75689.11   444.75     0.53    3.09   1.79  30.45
sdac              0.00   188.21  0.00 162.54     0.00 75645.95   465.41     0.52    3.21   1.81  29.37
dm-0              0.00     0.00  2.50 1622.18    19.98 604636.16   372.17     4.40    2.70   0.20  31.89



Typical top output looked like this (%CPU > 0 shown only):

top - 01:17:38 up  6:19,  0 users,  load average: 1.38, 1.31, 1.23
Tasks: 189 total,   1 running, 188 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 12.3%sy,  0.0%ni, 84.4%id,  2.6%wa,  0.1%hi,  0.6%si,  0.0%st
Mem:  66114888k total, 66006468k used,   108420k free,   137832k buffers
Swap:  2040212k total,     4120k used,  2036092k free, 39614736k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
   9965 root      20   0 18808 4808  580 S   72  0.0 243:32.36 aiod             
    391 root      15  -5     0    0    0 S    8  0.0  28:23.96 kswapd1          
    392 root      15  -5     0    0    0 S    7  0.0  25:28.88 kswapd2          
    393 root      15  -5     0    0    0 S    7  0.0  23:53.41 kswapd3          
   9860 root      15  -5     0    0    0 S    2  0.0   6:20.61 kjournald2    
   ....
                  


-----
[3] File size in this case is only 2TB (ext3 limit).

During read, typical iostat output looked like this:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    3.80   10.81    0.00   85.39

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              44.86     0.00 267.83  0.00 33553.65     0.00   125.28     0.35    1.31   1.07  28.73
sde              48.85     0.00 268.03  0.00 33528.07     0.00   125.09     0.36    1.33   1.04  27.85
sdi              23.58     0.00 262.04  0.00 33553.65     0.00   128.05     0.33    1.26   1.03  27.01
sdm              25.07     0.00 262.54  0.00 33553.65     0.00   127.81     0.33    1.24   1.04  27.29
sdq              46.45     0.00 266.63  0.00 33483.32     0.00   125.58     0.35    1.31   1.06  28.29
sdu              23.98     0.00 263.14  0.00 33554.45     0.00   127.52     0.36    1.36   1.07  28.13
sdy              44.76     0.00 267.73  0.00 33553.65     0.00   125.33     0.33    1.24   1.01  27.13
sdac             46.85     0.00 268.33  0.00 33553.65     0.00   125.05     0.32    1.19   0.99  26.57
dm-0              0.00     0.00 1427.57  0.00 268334.07     0.00   187.97     2.02    1.42   0.69  98.74


and typical top output:

top - 18:09:24 up  6:58,  0 users,  load average: 1.11, 1.12, 1.04
Tasks: 196 total,   1 running, 195 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.5%sy,  0.0%ni, 85.0%id, 11.0%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:  66115496k total, 66007072k used,   108424k free,    79140k buffers
Swap:  2040212k total,     3524k used,  2036688k free, 44337588k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
  10748 root      20   0 18808 4796  560 S   22  0.0  33:16.84 aiod             
    382 root      15  -5     0    0    0 S    2  0.0  13:49.95 kswapd1          
    383 root      15  -5     0    0    0 S    2  0.0  12:54.08 kswapd2          
    384 root      15  -5     0    0    0 S    2  0.0  12:03.61 kswapd3          
    ...

During write, iostat looked like this:


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   14.01    1.61    0.00   84.38

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  4997.10  0.00 96.40     0.00 40747.25   422.67     0.57    5.92   3.45  33.25
sde               0.00  4995.10  0.00 98.30     0.00 40747.25   414.51     0.55    5.54   3.15  30.93
sdi               0.00  4996.20  0.00 97.60     0.00 40724.88   417.25     0.54    5.56   3.12  30.49
sdm               0.00  4996.60  0.00 96.60     0.00 40720.08   421.52     0.54    5.57   3.14  30.29
sdq               0.00  4987.21  1.30 97.90    10.39 40655.34   409.93     0.55    5.52   3.10  30.77
sdu               0.00  4997.80  0.00 97.50     0.00 40736.86   417.80     0.58    5.96   3.39  33.05
sdy               0.00  5035.46  0.00 58.04     0.00 40748.05   702.04     0.34    5.80   5.33  30.93
sdac              0.00  5032.87  0.00 57.84     0.00 40725.67   704.08     0.36    6.16   5.66  32.77
dm-0              0.00     0.00  1.30 40737.06    10.39 325896.50     8.00   237.23    5.75   0.01  38.96

and top looked like this:

top - 13:58:25 up  2:47,  1 user,  load average: 1.40, 1.42, 1.37
Tasks: 200 total,   5 running, 195 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 11.9%sy,  0.0%ni, 83.6%id,  2.2%wa,  0.0%hi,  2.4%si,  0.0%st
Mem:  66115496k total, 66004292k used,   111204k free,   146224k buffers
Swap:  2040212k total,     4212k used,  2036000k free, 43686528k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
   9476 root      20   0 18808 4788  560 S   76  0.0  96:40.63 aiod             
   9373 root      15  -5     0    0    0 S    6  0.0   7:53.28 kjournald        
    382 root      15  -5     0    0    0 R    4  0.0   6:44.07 kswapd1          
    383 root      15  -5     0    0    0 R    4  0.0   6:08.69 kswapd2          
    384 root      15  -5     0    0    0 R    3  0.0   5:48.96 kswapd3          
    128 root      15  -5     0    0    0 S    1  0.0   0:39.71 kblockd/6        
    ...
-------------------------------------------------------------------------------


I'll be filling in some of the holes in the table above (but move everything forward
to 2.6.26-rc7 plus the current patch queue). Note also that these are preliminary
results and the test rig is subject to change. If you have suggestions for
improvements, I'd be very interested in hearing them.


Thanks,
Nick



----
PS. One additional tidbit: I ran fsck on the ext4 filesystem - it took about an
hour and a half (I presume uninit_bg would speed that up substantially since
there are only a handful of inodes in use). But I got an interesting question
from it:

    # time fsck.ext4 /dev/mapper/bigvg-bigvol
    e2fsck 1.41-WIP (17-Jun-2008)
    /dev/mapper/bigvg-bigvol primary superblock features different from backup, check forced.
    Pass 1: Checking inodes, blocks, and sizes
    Inode 49153, i_size is 14680064000000, should be 14680064000000.  Fix<y>? y
    yes

    Pass 2: Checking directory structure
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information

    /dev/mapper/bigvg-bigvol: ***** FILE SYSTEM WAS MODIFIED *****
    /dev/mapper/bigvg-bigvol: 12/2050768896 files (8.3% non-contiguous), 3648395645/4101537792 blocks

    real	197m45.245s
    user	24m56.602s
    sys	9m16.335s


The real time includes about 30 mins of waiting at the ? while I was not
looking, so it's not as bad as it looks.  But why was there a question
since the reported sizes are the same? and why are the primary
superblock features different from backup?


Also, during pass1, every disk was being hit:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.79    0.00    1.70    9.72    0.00   86.79

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda             882.20     0.00 61.80  0.00  7552.00     0.00   122.20     0.29    4.64   4.27  26.36
sde             898.60     0.00 45.50  0.00  7552.00     0.00   165.98     0.26    5.78   4.84  22.00
sdi             880.10     0.00 63.90  0.00  7552.00     0.00   118.18     0.29    4.58   4.18  26.72
sdm             900.60     0.00 43.40  0.00  7552.00     0.00   174.01     0.25    5.73   4.87  21.12
sdq            1195.10     0.00 193.60  0.00 11101.60     0.00    57.34     0.69    3.56   2.07  40.00
sdu            1273.20     0.00 55.60  0.00 10630.40     0.00   191.19     0.30    5.45   5.00  27.80
sdy             880.70     0.00 63.90  0.00  7556.80     0.00   118.26     0.28    4.44   3.96  25.28
sdac            899.20     0.00 48.00  0.00  7577.60     0.00   157.87     0.26    5.49   4.57  21.92
dm-0              0.00     0.00 8385.60  0.00 67084.80     0.00     8.00    44.06    5.25   0.11  95.16


During pass5, only one disk was being hit:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    0.14   12.39    0.00   87.46

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdm               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdq             391.20     0.00 260.70  0.00  5212.80     0.00    20.00     1.00    3.99   3.83  99.76
sdu               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdy               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdac              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00 651.90  0.00  5215.20     0.00     8.00     2.66    4.14   1.53  99.72


Perhaps this indicates a less-than-optimal set-up of the storage?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html