linux-kernel - Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 21 Dec 2009 14:11:46 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	venkatesh.pallipadi@...el.com
Cc:	Gautham R Shenoy <ego@...ibm.com>,
	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>,
	Thomas Gleixner <tglx@...utronix.de>,
	Arjan van de Ven <arjan@...radead.org>,
	linux-kernel@...r.kernel.org,
	Suresh Siddha <suresh.b.siddha@...el.com>
Subject: Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it
 per NUMA node

On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@...el.com wrote:
> plain text document attachment
> (0002-sched-Scale-the-nohz_tracker-logic-by-making-it-per.patch)
> Having one idle CPU doing the rebalancing for all the idle CPUs in
> nohz mode does not scale well with increasing number of cores and
> sockets. Make the nohz_tracker per NUMA node. This results in multiple
> idle load balancing happening at NUMA node level and idle load balancer
> only does the rebalance domain among all the other nohz CPUs in that
> NUMA node.
> 
> This addresses the below problem with the current nohz ilb logic
> * The lone balancer may end up spending a lot of time doing the
> * balancing on
>   behalf of nohz CPUs, especially with increasing number of sockets and
>   cores in the platform.

Right, so I think the whole NODE idea here is wrong, it all seems to
work out properly if you simply pick one sched domain larger than the
one that contains all of the current socket and contains an idle unit.

Except that the sched domain stuff is not properly aware of bigger
topology things atm.

The sched domain tree should not view node as the largest structure and
we should remove that current random node split crap we have.

Instead the sched domains should continue to express the topology, like
nodes within 1 hop, nodes within 2 hops, etc.

Then this nohz idle balancing should pick the socket level (which might
be larger than the node level), and walks up the domain tree, until we
reach a level where it has a whole idle group.

This means that we'll always span at least 2 sockets, which means we'll
gracefully deal with the overload scenario.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/