linux-kernel - Re: scheduler scalability - cgroups, cpusets and load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20080129102836.be614579.pj@sgi.com>
Date:	Tue, 29 Jan 2008 10:28:36 -0600
From:	Paul Jackson <pj@....com>
To:	"Gregory Haskins" <ghaskins@...ell.com>
Cc:	a.p.zijlstra@...llo.nl, mingo@...e.hu, dmitry.adamushko@...il.com,
	rostedt@...dmis.org, menage@...gle.com, rientjes@...gle.com,
	tong.n.li@...el.com, tglx@...utronix.de, akpm@...ux-foundation.org,
	dhaval@...ux.vnet.ibm.com, vatsa@...ux.vnet.ibm.com,
	sgrubb@...hat.com, linux-kernel@...r.kernel.org,
	ebiederm@...ssion.com, nickpiggin@...oo.com.au
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
>   I am a bit confused as to why you disable load-balancing in the
>   RT cpuset?  It shouldn't be strictly necessary in order for the
>   RT scheduler to do its job (unless I am misunderstanding what you
>   are trying to accomplish?).  Do you do this because you *have*
>   to in order to make real-time deadlines, or because its just a
>   further optimization?

My primary motivation for cpusets originally, and for the
sched_load_balance flag now, was not realtime, but "soft partitioning"
of big NUMA systems, especially for batch schedulers.  They sometimes
have large cpusets which are only being used to hold smaller, per-job,
cpusets.  It is a waste of time (CPU cycles in the kernel sched code)
to load balance those large cpusets.  Load balancing doesn't scale
easily to high CPU counts, and it's nice to avoid doing that where
not needed.

See the following lkml message for a fuller explanation:

  http://lkml.org/lkml/2008/1/29/85

As a secondary motivation, I thought that disabling load balancing on
the RT cpuset was the right thing to do for RT needs, but I make no
claim to knowing much about RT.

I just now realized that you added a 'root_domain' in a patch in
late Nov and early Dec.   I was on the road then, moving from
California to Texas, and not paying much attention to Linux.

A couple of questions on that patch, both involving a comment it adds
to kernel/sched.c:

/*
 * We add the notion of a root-domain which will be used to define per-domain
 * variables. Each exclusive cpuset essentially defines an island domain by
 * fully partitioning the member cpus from any other cpuset. Whenever a new
 * exclusive cpuset is created, we also create and attach a new root-domain
 * object.
 */

1) What are 'per-domain' variables?

2) The mention of 'exclusive cpuset' is no longer correct.

   With the patch 'remove sched domain hooks from cpusets' cpusets
   no longer defines sched domains using the cpu_exclusive flag.

   With the subsequent sched_load_balance patch (see
   http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
   flag 'sched_load_balance' to define sched domains.

The following revised comment might be more accurate:

/*
 * We add the notion of a root-domain which will be used to define per-domain
 * variables.  Each non-overlapping sched domain defines an island domain by
 * fully partitioning the member cpus from any other cpuset. Whenever a new
 * such a sched domain is created, we also create and attach a new root-domain
 * object.  These non-overlapping sched domains are determined by the cpuset
 * configuration, via a call to partition_sched_domains().
 */

It sounds like you (Gregory, others) want your RT CPUs to be in a sched
domain, unlike the current way things are, where my cpuset code
carefully avoids setting up a sched domain for those CPUs.  However I
still have need, in the batch scheduler case explained above, to have
some CPUs not in any sched domain.

If you require these RT sched domains to be setup differently somehow,
in some way that is visible to partition_sched_domains, then that
apparently means we need a per-cpuset flag to mark those RT cpusets.

If you just want an ordinary sched domain setup (just so long as it
contains only the intended RT CPUs, not others) then I guess we don't
technically need any more per-cpuset flags, but I'm worried, because
the API we're presenting to users for this has just gone from subtle to
bizarre.  I suspect I'll want to add a flag anyway, if by doing so, I
can make the kernel-user API, via cpusets, easier to understand.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@....com> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/