linux-kernel - Re: [PATCH] sched/fair: change where we report sched stats V2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <548F0025.4040203@fb.com>
Date:	Mon, 15 Dec 2014 10:37:09 -0500
From:	Josef Bacik <jbacik@...com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	<bmaurer@...com>, <rkroll@...com>, <kernel-team@...com>,
	<mingo@...hat.com>, <linux-kernel@...r.kernel.org>,
	<umgwanakikbuti@...il.com>
Subject: Re: [PATCH] sched/fair: change where we report sched stats V2

On 12/15/2014 05:16 AM, Peter Zijlstra wrote:
> On Thu, Dec 11, 2014 at 10:59:55AM -0500, Josef Bacik wrote:
>> The schedule stats currently spit out once the entity is being queued, which
>> means if we have stack traces enabled we will get the stack trace of the waker,
>> not of the task being woken.  This makes the backtrace completely useless when
>> trying to track down latency spikes, as we want to know why we were put to sleep
>> for as long as we were.
>>
>> This patch moves the trace_sched_stat* calls to after the task has been put back
>> on the CPU so that the backtrace we get is the one that put us to sleep.  Mike
>> Galbraith suggested that I leave the actual stat collection stuff where it is so
>> that we don't add the scheduler latency to the sleep stats, so I've added 3 new
>> fields in the scheduler stats to keep track of the last sleep time so that if
>> any of these are set we know to trigger the trace points.  Thanks,
>>
>
> Yeah, so I don't like this, it adds overhead for everyone.
>

Only if SCHEDSTATS is enabled tho, and it's no more overhead in the 
SCHEDSTATS case than before.  Would it be more acceptable to move the 
entire callback under SCHEDSTATS?

> Also, I'm very sure this issue has been raised before, and I'm pretty
> sure the answer back then was why don't you look at the
> trace_sched_switch() site matching this wakeup?
>

This is fine for discrete problems, but when trying to find a random 
latency spike in a production workload it's impossible. If I do

trace-cmd record -e sched:sched_switch -T sleep 5

on just one of our random web servers I end up with this

du -h trace.dat
62M     trace.dat

thats 62 megs in 5 seconds.  I ran the following command for almost 2 
hours when searching for a latency spike

trace-cmd record -B latency -e sched:sched_stat_blocked -f \"delay >= 
100000\" -T -o /root/latency.dat

and got the following .dat file

du -h latency.dat
48M     latency.dat

That right there is a huge reason for this change, so I'd really like to 
do whatever it takes to get this fixed.  I'd be happy to move the 
callback under CONFIG_SCHEDSTATS to reduce any possible overhead you 
would be concerned about.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/