linux-kernel - Re: High priority tasks break SMP balancer?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20071116221404.GC31527@vmware.com>
Date:	Fri, 16 Nov 2007 14:14:04 -0800
From:	Micah Dowty <micah@...are.com>
To:	Dmitry Adamushko <dmitry.adamushko@...il.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Christoph Lameter <clameter@....com>,
	Kyle Moffett <mrmacman_g4@....com>,
	Cyrus Massoumi <cyrusm@....net>,
	LKML Kernel <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...l.org>, Mike Galbraith <efault@....de>,
	Paul Menage <menage@...gle.com>,
	Peter Williams <pwil3058@...pond.net.au>
Subject: Re: High priority tasks break SMP balancer?

On Fri, Nov 16, 2007 at 11:48:50AM +0100, Dmitry Adamushko wrote:
> could you try to change either :
> 
> cat /proc/sys/kernel/sched_stat_granularity
> 
> put it to the value equal to a tick on your system

This didn't seem to have any effect.

> or just remove bit #3 (which is responsible for 8 == 1000) here:
> 
> cat /proc/sys/kernel/sched_features
> 
> (this one is enabled by default in 2.6.23.1)

Aha. Turning off bit 3 appears to instantly fix my problem while it's
occurring in an existing process, and I can't reproduce it on any new
processes afterward.

> anyway, when it comes to calculating rq->cpu_load[], a nice(0) cpu-hog
> task (on cpu_0) may generate a similar load (contribute to
> rq->cpu_load[]) as e.g. some negatively reniced task (on cpu_1) which
> runs only periodically (say, once in a tick for N ms., etc.) [*]
> 
> The thing is that the higher a prio of the task, the bigger 'weight'
> it has (prio_to_wait[] table in sched.c) ... and roughly, the load it
> generates is not only 'proportional' to 'run-time per fixed interval
> of time' but also to its 'weight'. That's why the [*] above.

Right. I gathered from reading the scheduler source earlier that the
load average is intended to be proportional to the priority of the
task, but I was really confused by the fairly nondeterministic effect
on the cpu_load average that my test process is having.

> so you may have a situation :
> 
> cpu_0 : e.g. a nice(-20) task running periodically every tick and
> generating, say ~10% cpu load ;

Part of the problem may be that my high-priority task can run much
more often than every tick. In my test case and in the VMware code
that I originally observed the problem in, the thread can wake up
based on /dev/rtc or on a device IRQ. Either of these can happen much
more frequently than the scheduler tick, if I understand correctly.

> cpu_1 : 2-3 nice(0) cpu-hog tasks ;
> 
> both cpus may be seen with similar rq->load_cpu[]...

When I try this, cpu0 has a cpu_load[] of over 10000 and cpu1 has a
load of 2048 or so.

> yeah, one would
> argue that one of the cpu hogs could be migrated to cpu_0 and consume
> remaining 'time slots' and it would not "disturb" the nice(-20) task
> as :
> it's able to preempt the lower prio task whenever it want (provided,
> fine-grained kernel preemption) and we don't care that much of
> trashing of caches here.

Yes, that's the behaviour I expected to see (and what my application
would prefer).

> btw., without the precise load balancing, there can be situations when
> the nice(-20) (or say, a RT periodic task) can be even not seen (i.e.
> don't contribute to cpu_load[]) on cpu_0...
> we do sampling every tick (sched.c :: update_cpu_load()) and consider
> this_rq->ls.load.weight at this particular moment (that is the sum of
> 'weights' for all runnable tasks on this rq)... and it may well be
> that the aforementioned high-priority task is just never (or likely,
> rarely) runnable at this particular moment (it runs for short interval
> of time in between ticks).

Indeed. I think this is the major contributor to the nondeterminism
I'm seeing.

Thanks much,
--Micah
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/