linux-kernel - Re: Perfromance drop on SCSI hard disk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BANLkTimPhB9wBuNt+CLMv2pvYkJ27t1o+w@mail.gmail.com>
Date:	Thu, 12 May 2011 08:36:05 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	"Shi, Alex" <alex.shi@...el.com>, jaxboe@...ionio.com
Cc:	"James.Bottomley@...senpartnership.com" 
	<James.Bottomley@...senpartnership.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	tim.c.chen@...el.com, akpm@...ux-foundation.org
Subject: Re: Perfromance drop on SCSI hard disk

2011/5/10 Shaohua Li <shaohua.li@...el.com>:
> On Tue, May 10, 2011 at 02:40:00PM +0800, Shi, Alex wrote:
>> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
>> scsi_run_queue() to punt all requests on starved_list devices to
>> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
>> hurt here.  :) (Intel SSD isn't effected here)
>>
>> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
>> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
>> about 20%/50% throughput. and fio mmap testing was hurt also.
>>
>> With the following debug patch, the performance can be totally recovered
>> in our testing. But without REENTER flag here, in some corner case, like
>> a device is keeping blocked and then unblocked repeatedly,
>> __blk_run_queue() may recursively call scsi_run_queue() and then cause
>> kernel stack overflow.
>> I don't know details of block device driver, just wondering why on scsi
>> need the REENTER flag here. :)
> Hi Jens,
> I want to add more analysis about the problem to help understand the issue.
> This is a severe problem, hopefully we can solve it before 2.6.39 release.
>
> Basically the offending patch has some problems:
> a. more context switches
> b. __blk_run_queue losts the recursive detection. In some cases, it could be
>  called recursively. For example, blk_run_queue in scsi_run_queue()
> c. fairness issue. Say we have sdev1 and sdev2 in starved_list. Then run
> scsi_run_queue():
>  1. remove both sdev1 and sdev2 from starved_list
>  2. async queue dispatches sdev1's request. host becames busy again.
>  3. add sdev1 into starved_list again. Since starved_list is empty,
>     sdev1 is added at the head
>  4. async queue checks sdev2, and add sdev2 into starved_list tail.
>  In this scenario, sdev1 is serviced first next time, so sdev2 is starved.
>  In our test, 12 disks connect to one HBA card. disk's queue depth is 64,
>  while HBA card queue depth is 128. Our test does sync write, so block size
>  is big, so just several requests can occurpy one disk's bandwidth. Saturate
>  one disk but starve others will hurt total throughput.
>
> problem a isn't a big problem in our test (we did observe higher context
> switch, which is about 4x more CS), but guess it will hurt high end system.
>
> problem b is easy to fix for scsi. just replace blk_run_queue with
> blk_run_queue_async in scsi_run_queue
>
> problem c is the root cause for the regression. I had a patch for it.
> Basically with my patch, we don't remove sdev from starved_list in
> scsi_run_queue, but we delay the removal in scsi_request_fn() till a
> starved device really dispatches a request. My patch can fully fix
> the regression.
>
> But given problem a, we should revert the patch (or Alex's patch if stack
> overflow isn't a big deal here), so I didn't post my patch here. Problem
> c actually exsists even we revert the patch (we could do async execution
> with small chance), but not that severe. I can post a fix after the
> patch is reverted.
Hi,
Ping again. I hope this issue isn't missed. The regression is big from
20% ~ 50% of IO throughput.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/