linux-kernel - Re: [patch] CFS (Completely Fair Scheduler), v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46256102.7030104@bigpond.net.au>
Date:	Wed, 18 Apr 2007 10:06:26 +1000
From:	Peter Williams <pwil3058@...pond.net.au>
To:	Willy Tarreau <w@....eu>
CC:	Gene Heskett <gene.heskett@...il.com>, Ingo Molnar <mingo@...e.hu>,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Con Kolivas <kernel@...ivas.org>,
	Nick Piggin <npiggin@...e.de>, Mike Galbraith <efault@....de>,
	Arjan van de Ven <arjan@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>, caglar@...dus.org.tr,
	Dmitry Adamushko <dmitry.adamushko@...il.com>
Subject: Re: [patch] CFS (Completely Fair Scheduler), v2

Willy Tarreau wrote:
> Hi Gene,
> 
> On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
>> On Monday 16 April 2007, Ingo Molnar wrote:
>>> this is the second release of the CFS (Completely Fair Scheduler)
>>> patchset, against v2.6.21-rc7:
>>>
>>>   http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>>
>>> i'd like to thank everyone for the tremendous amount of feedback and
>>> testing the v1 patch got - i could hardly keep up with just reading the
>>> mails! Some of the stuff people addressed i couldnt implement yet, i
>>> mostly concentrated on bugs, regressions and debuggability.
>>>
>>> there's a fair amount of churn:
>>>
>>>   15 files changed, 456 insertions(+), 241 deletions(-)
>>>
>>> But it's an encouraging sign that there was no crash bug found in v1,
>>> all the bugs were related to scheduling-behavior details. The code was
>>> tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>>> code size increase in -v2 is due to debugging helpers, they'll be
>>> removed later. (The new /proc/sched_debug file can be used to see the
>>> fine details of CFS scheduling.)
>>>
>>> Changes since -v1:
>>>
>>> - make nice levels less starvable. (reported by Willy Tarreau)
>>>
>>> - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>>>   flag can be used to turn it on/off. (This might fix the Kaffeine bug
>>>   reported by S.Ça??lar Onur <)
>>>
>>> - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>>>
>>> - UP build fix. (reported by Gabriel C)
>>>
>>> - timer tick micro-optimization (Dmitry Adamushko)
>>>
>>> - preemption fix: sched_class->check_preempt_curr method to decide
>>>   whether to preempt after a wakeup (or at a timer tick). (Found via a
>>>   fairness-test-utility written for CFS by Mike Galbraith)
>>>
>>> - start forked children with neutral statistics instead of trying to
>>>   inherit them from the parent: Willy Tarreau reported that this
>>>   results in better behavior on extreme workloads, and it also
>>>   simplifies the code quite nicely. Removed sched_exit() and the
>>>   ->task_exit() methods.
>>>
>>> - make nice levels independent of the sched_granularity value
>>>
>>> - new /proc/sched_debug file listing runqueue details and the rbtree
>>>
>>> - new SCH-* fields in /proc/<NR>/status to see scheduling details
>>>
>>> - new cpu-hog feature (off by default) and sysctl tunable to set it:
>>>   /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
>>>   0 (off). Positive values are meant the maximum 'memory' that the
>>>   scheduler has of CPU hogs.
>>>
>>> - various code cleanups
>>>
>>> - added more statistics temporarily: sum_exec_runtime,
>>>   sum_wait_runtime.
>>>
>>> - added -CFS-v2 to EXTRAVERSION
>>>
>>> as usual, any sort of feedback, bugreports, fixes and suggestions are
>>> more than welcome,
>>>
>>> 	Ingo
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was much 
>> better.  Watching amanda run with htop, kmails composer is being subjected to 
>> 5 to 10 second pauses, and htop says that gzip -best isn't getting more that 
>> 15% of the cpu, and the /amandatapes drive is being written to in a regular 
>> pattern that seems to be the cause of the pauses  according to gkrellm, which 
>> also seems to track the size of the writes, and can show anything from 4.3k 
>> to 54 megs as being written in one cycle of its screen update.
> 
> Have you tried previous version with the fair-fork patch ? It might be possible
> that your workload is sensible to the fork()'s child getting much CPU upon
> startup.
> 
> Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
> new tasks are "forked", they are queued at the end of the run queue with a
> fixed priority. In our case, this would translate into assigning them the
> same prio and timeslice as their parent, but queuing them at the end so that
> they don't make existing tasks starve during huge fork() loads.
> 
> I don't know how that would be possible (nor if that would help in anything),
> but I found it was a good compromise over sharing the timeslice with the
> parent. Perhaps we should have some absolute timeslice and some relative
> timeslice (eg: X percent of total time divided by the number of tasks) ?

One way of handling forked tasks is to give them a high priority but a 
small chunk (i.e. give them a relatively short time to do some work and 
surrender the CPU voluntarily before you boot them off).  If you choose 
the size of this reduced chunk well the vast majority of tasks will 
never be booted off and will do a small bit of work and either exit or 
sleep and will suffer no penalty as a result of this mechanism.  But it 
gives you a chance to move any newly forked process that turns out to be 
a CPU hog to a lower priority before it gets its next chunk of CPU at 
which time it can revert to getting normal size chunks as pre-emption 
will stop it hogging the CPU from then on.

I've trialled this mechanism in some of my schedulers and it works well.

I found that 10 milliseconds was a good value for the initial chunk of 
CPU for a newly forked process.

Peter
-- 
Peter Williams                                   pwil3058@...pond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/