linux-kernel - Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1215049766.2840.43.camel@dwillia2-linux.ch.intel.com>
Date:	Wed, 02 Jul 2008 18:49:26 -0700
From:	Dan Williams <dan.j.williams@...el.com>
To:	Mel Gorman <mel@....ul.ie>
Cc:	Andy Whitcroft <apw@...dowen.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, NeilBrown <neilb@...e.de>,
	babydr@...y-dragons.com, cl@...ux-foundation.org,
	lee.schermerhorn@...com
Subject: Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86
	(bisected)

On Tue, 2008-07-01 at 22:18 -0700, Mel Gorman wrote:
> > Let us take commit 8b3e6cdc out of the equation and just look at raid0
> > performance:
> >
> > revision   2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
> >            279          278      273                277
> >            281          278      275                277
> >            281          113      68.7               66.8
> >            279          69.2     277                73.7
> >            278          75.6     62.5               80.3
> > MB/s (avg) 280          163      191                155
> > % change   0%           -42%     -32%               -45%
> > result     base         bad      bad                bad
> >
> 
> Ok, based on your other mail, 54a6eb5c here is a bisection point. The good
> figures are on par with the "good" kernel with some disasterous runs leading
> to a bad average. The thing is, the bad results are way worse than could be
> accounted for by two-zonelist alone.  In fact, the figures look suspiciously
> like only 1 disk is in use as they are roughly quartered. Can you think of
> anything that would explain that?

raid0 in contrast to raid5 is fairly simple it just remaps the bio.
Then it is up to the block layer to do some merging before it hits the
block device driver.  Not much room for error, especially since single
drive performance seems to be unaffected.  The data seems to show that
the vm is not queuing pages to be written back fast enough.

> Can you also confirm that using a bisection
> point before two-zonelist runs steadily and with high performance as expected
> please? This is to rule out some other RAID patch being a factor.

The immediately preceeding commit 18ea7e71 always runs with high
performance, and raid0 was virtually untouched during the merge window:

$ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c
Neil Brown (1):
      Remove blkdev warning triggered by using md

> It would be worth running vmstat during the tests so we can see if IO
> rates are dropping from an overall system perspective. If possible,
> oprofile data from the same time would be helpful to see does it show up
> where we are getting stuck.

I watched /proc/vmstat during a dd run, and in the good case
nr_writeback never drops below 21,000.  In the bad case it may briefly
break a few hundred pages.

This picture generated by seekwatcher shows that there are gaps in i/o
submission:

	http://userweb.kernel.org/~djbw/md0-bad2.png

Compared to a good run:

	http://userweb.kernel.org/~djbw/md0-good.png

What is interesting is that I was only able to get this data with the
following command:

./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb

When I tried to watch all four disks, i.e. 5 blktrace tasks running
instead of 2:

./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde

...it always measured good performance, as if having more tasks running
stimulates the vm to keep writeback going at a high rate.  Could
writeback be getting stalled on one cpu or more cpu's?

Oprofile only seems to show that we are idling in the failure case:
---good---
samples  %        image name         symbol name
18229    20.7171  vmlinux-18ea7e71   __copy_from_user_ll_nocache_nozero
4518      5.1347  vmlinux-18ea7e71   clear_user
2137      2.4287  vmlinux-18ea7e71   get_page_from_freelist
1929      2.1923  vmlinux-18ea7e71   constant_test_bit
1745      1.9832  vmlinux-18ea7e71   clear_bit
1320      1.5002  vmlinux-18ea7e71   kmem_cache_alloc
1290      1.4661  vmlinux-18ea7e71   blk_rq_map_sg
1287      1.4627  vmlinux-18ea7e71   bio_put
1022      1.1615  raid0.ko           raid0_make_request
938       1.0660  vmlinux-18ea7e71   __make_request
938       1.0660  vmlinux-18ea7e71   test_clear_page_writeback
899       1.0217  vmlinux-18ea7e71   __percpu_counter_add
---bad---
samples  %        image name    symbol name
17844    16.3714  vmlinux-54a6eb5c   __copy_from_user_ll_nocache_nozero
4565      4.1883  vmlinux-54a6eb5c   clear_user
2229      2.0450  vmlinux-54a6eb5c   get_page_from_freelist
2202      2.0203  vmlinux-54a6eb5c   mwait_idle
2129      1.9533  vmlinux-54a6eb5c   schedule
1882      1.7267  vmlinux-54a6eb5c   constant_test_bit
1730      1.5872  vmlinux-54a6eb5c   clear_bit
1157      1.0615  vmlinux-54a6eb5c   kmem_cache_alloc


> > These numbers are taken from the results of:
> > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> >
> > Where md0 is created by:
> > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0
> >
> > I will try your debug patch next Mel, and then try to collect more data
> > with blktrace.
> >
> 
> I tried reproducing this but I don't have the necessary hardware to even
> come close to reproducing your test case :( .  I got some dbench results
> with oprofile but found no significant differences between 2.6.25 and
> 2.6.26-rc8. However, I did find the results of dbench varied less between
> runs with the "repork" patch that made next_zones_zonelist() an inline
> function. Have you tried that patch yet?

I gave the patch a shot but observed no change.  As you predicted we are
looking at something more pronounced than some lost cpu cycles.

--
Dan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/