lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50B93D3D.3030106@iskon.hr>
Date:	Sat, 01 Dec 2012 00:11:57 +0100
From:	Zlatko Calusic <zlatko.calusic@...on.hr>
To:	Tejun Heo <tj@...nel.org>
CC:	linux-kernel@...r.kernel.org
Subject: Re: High context switch rate, ksoftirqd's chewing cpu

On 30.11.2012 23:52, Tejun Heo wrote:
> Hello, Zlatko.
>
> Sorry about the delay.  Your message was in my spam folder.  The
> attachment seems to have confused the filter.
>
> On Sat, Nov 17, 2012 at 02:01:29PM +0100, Zlatko Calusic wrote:
>> This week I spent some hours tracking a regression in 3.7 kernel
>> that was producing high context switch rate on one of my machines. I
>> carefully bisected between 3.6 and 3.7-rc1 and eventually found this
>> commit a culprit:
>>
>> commit e7c2f967445dd2041f0f8e3179cca22bb8bb7f79
>> Author: Tejun Heo <tj@...nel.org>
>> Date:   Tue Aug 21 13:18:24 2012 -0700
>>
>>      workqueue: use mod_delayed_work() instead of __cancel + queue
> ...
>>
>> Then I carefully reverted chunk by chunk to find out what exact
>> change is responsible for the regression. You can find it attached
>> as wq.patch (to preserve whitespace). Very simple modification with
>> wildly different behavior on only one of my machines, weird. I'm
>> also attaching ctxt/s graph that shows the impact nicely. I'll
>> gladly provide any additional info that could help you resolve this.
>>
>> Please Cc: on reply (not subscribed to lkml).
>>
>> Regards,
>> --
>> Zlatko
>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 4b4dbdf..4b8b606 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -319,10 +319,8 @@ EXPORT_SYMBOL(__blk_run_queue);
>>    */
>>   void blk_run_queue_async(struct request_queue *q)
>>   {
>> -	if (likely(!blk_queue_stopped(q))) {
>> -		__cancel_delayed_work(&q->delay_work);
>> -		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
>> -	}
>> +	if (likely(!blk_queue_stopped(q)))
>> +		mod_delayed_work(kblockd_workqueue, &q->delay_work, 0);
>>   }
>>   EXPORT_SYMBOL(blk_run_queue_async);
>
> That's intersting.  Is there anything else noticeably different than
> the ctxsw counts?  e.g. CPU usage, IO throughput / latency, etc...
> Also, can you please post the kernel boot log from the machine?  I
> assume that the issue is readily reproducible?  Are you up for trying
> some debug patches?
>
> Thanks.
>

Hey Tejun! Thanks for replying.

It's an older C2D machine, I've attached the kernel boot log. Funny 
thing is that on the other half a dozen machines I don't observe any 
problems, only on this one. And it's reproducible every time. I don't 
see any other anomalies beside the two I already mentioned, high context 
switch rate and ksoftirqd daemons eating more CPU, probably as a 
consequence.

I'll gladly try your patch and send my observations tommorow, as I've 
just started md resync on the machine, which will take couple of hours.

Regards,
-- 
Zlatko

View attachment "dmesg.txt" of type "text/plain" (23703 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ