[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20071116221404.GC31527@vmware.com>
Date: Fri, 16 Nov 2007 14:14:04 -0800
From: Micah Dowty <micah@...are.com>
To: Dmitry Adamushko <dmitry.adamushko@...il.com>
Cc: Ingo Molnar <mingo@...e.hu>, Christoph Lameter <clameter@....com>,
Kyle Moffett <mrmacman_g4@....com>,
Cyrus Massoumi <cyrusm@....net>,
LKML Kernel <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...l.org>, Mike Galbraith <efault@....de>,
Paul Menage <menage@...gle.com>,
Peter Williams <pwil3058@...pond.net.au>
Subject: Re: High priority tasks break SMP balancer?
On Fri, Nov 16, 2007 at 11:48:50AM +0100, Dmitry Adamushko wrote:
> could you try to change either :
>
> cat /proc/sys/kernel/sched_stat_granularity
>
> put it to the value equal to a tick on your system
This didn't seem to have any effect.
> or just remove bit #3 (which is responsible for 8 == 1000) here:
>
> cat /proc/sys/kernel/sched_features
>
> (this one is enabled by default in 2.6.23.1)
Aha. Turning off bit 3 appears to instantly fix my problem while it's
occurring in an existing process, and I can't reproduce it on any new
processes afterward.
> anyway, when it comes to calculating rq->cpu_load[], a nice(0) cpu-hog
> task (on cpu_0) may generate a similar load (contribute to
> rq->cpu_load[]) as e.g. some negatively reniced task (on cpu_1) which
> runs only periodically (say, once in a tick for N ms., etc.) [*]
>
> The thing is that the higher a prio of the task, the bigger 'weight'
> it has (prio_to_wait[] table in sched.c) ... and roughly, the load it
> generates is not only 'proportional' to 'run-time per fixed interval
> of time' but also to its 'weight'. That's why the [*] above.
Right. I gathered from reading the scheduler source earlier that the
load average is intended to be proportional to the priority of the
task, but I was really confused by the fairly nondeterministic effect
on the cpu_load average that my test process is having.
> so you may have a situation :
>
> cpu_0 : e.g. a nice(-20) task running periodically every tick and
> generating, say ~10% cpu load ;
Part of the problem may be that my high-priority task can run much
more often than every tick. In my test case and in the VMware code
that I originally observed the problem in, the thread can wake up
based on /dev/rtc or on a device IRQ. Either of these can happen much
more frequently than the scheduler tick, if I understand correctly.
> cpu_1 : 2-3 nice(0) cpu-hog tasks ;
>
> both cpus may be seen with similar rq->load_cpu[]...
When I try this, cpu0 has a cpu_load[] of over 10000 and cpu1 has a
load of 2048 or so.
> yeah, one would
> argue that one of the cpu hogs could be migrated to cpu_0 and consume
> remaining 'time slots' and it would not "disturb" the nice(-20) task
> as :
> it's able to preempt the lower prio task whenever it want (provided,
> fine-grained kernel preemption) and we don't care that much of
> trashing of caches here.
Yes, that's the behaviour I expected to see (and what my application
would prefer).
> btw., without the precise load balancing, there can be situations when
> the nice(-20) (or say, a RT periodic task) can be even not seen (i.e.
> don't contribute to cpu_load[]) on cpu_0...
> we do sampling every tick (sched.c :: update_cpu_load()) and consider
> this_rq->ls.load.weight at this particular moment (that is the sum of
> 'weights' for all runnable tasks on this rq)... and it may well be
> that the aforementioned high-priority task is just never (or likely,
> rarely) runnable at this particular moment (it runs for short interval
> of time in between ticks).
Indeed. I think this is the major contributor to the nondeterminism
I'm seeing.
Thanks much,
--Micah
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists