[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <464DA61A.4040406@bigpond.net.au>
Date: Fri, 18 May 2007 23:11:54 +1000
From: Peter Williams <pwil3058@...pond.net.au>
To: Ingo Molnar <mingo@...e.hu>
CC: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote:
> * Peter Williams <pwil3058@...pond.net.au> wrote:
>
>> I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1
>> with and without CFS; and the problem is always present. It's not
>> "nice" related as the all four tasks are run at nice == 0.
>
> could you try -v13 and did this behavior get better in any way?
It's still there but I've got a theory about what the problems is that
is supported by some other tests I've done.
What I'd forgotten is that I had gkrellm running as well as top (to
observe which CPU tasks were on) at the same time as the spinners were
running. This meant that between them top, gkrellm and X were using
about 2% of the CPU -- not much but enough to make it possible that at
least one of them was running when the load balancer was trying to do
its thing.
This raises two possibilities: 1. the system looked balanced and 2. the
system didn't look balanced but one of top, gkrellm or X was moved
instead of one of the spinners.
If it's 1 then there's not much we can do about it except say that it
only happens in these strange circumstances. If it's 2 then we may have
to modify the way move_tasks() selects which tasks to move (if we think
that the circumstances warrant it -- I'm not sure that this is the case).
To examine these possibilities I tried two variations of the test.
a. run the spinners at nice == -10 instead of nice == 0. When I did
this the load balancing was perfect on 10 consecutive runs which
according to my calculations makes it 99.9999997 certain that this
didn't happen by chance. This supports theory 2 above.
b. run the tests without gkrellm running but use nice == 0 for the
spinners. When I did this the load balancing was mostly perfect but was
quite volatile (switching between a 2/2 and 1/3 allocation of spinners
to CPUs) but the %CPU allocation was quite good with the spinners all
getting approximately 49% of a CPU each. This also supports theory 2
above and gives weak support to theory 1 above.
This leaves the question of what to do about it. Given that most CPU
intensive tasks on a real system probably only run for a few tens of
milliseconds it probably won't matter much on a real system except that
a malicious user could exploit it to disrupt a system.
So my opinion is that we probably do need to do something about it but
that it's not urgent.
One thing that might work is to jitter the load balancing interval a
bit. The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem. A jittered load balancing
interval should break the synchronization. This would certainly be
simpler than trying to change the move_task() logic for selecting which
tasks to move.
What do you think?
Peter
--
Peter Williams pwil3058@...pond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists