linux-kernel - Re: [PATCH 13/40] autonuma: CPU follow memory algorithm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FEDB797.3050804@gmail.com>
Date:	Fri, 29 Jun 2012 22:11:35 +0800
From:	Nai Xia <nai.xia@...il.com>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
CC:	Andrea Arcangeli <aarcange@...hat.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Hillf Danton <dhillf@...il.com>, Dan Smith <danms@...ibm.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...e.hu>, Paul Turner <pjt@...gle.com>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	Mike Galbraith <efault@....de>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Bharata B Rao <bharata.rao@...il.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>,
	Christoph Lameter <cl@...ux.com>,
	Alex Shi <alex.shi@...el.com>,
	Mauricio Faria de Oliveira <mauricfo@...ux.vnet.ibm.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Don Morris <don.morris@...com>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>
Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm



On 2012年06月28日 22:46, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
>> +/*
>> + * This function sched_autonuma_balance() is responsible for deciding
>> + * which is the best CPU each process should be running on according
>> + * to the NUMA statistics collected in mm->mm_autonuma and
>> + * tsk->task_autonuma.
>> + *
>> + * The core math that evaluates the current CPU against the CPUs of
>> + * all _other_ nodes is this:
>> + *
>> + *     if (w_nid>  w_other&&  w_nid>  w_cpu_nid)
>> + *             weight = w_nid - w_other + w_nid - w_cpu_nid;
>> + *
>> + * w_nid: NUMA affinity of the current thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_other: NUMA affinity of the other thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
>> + * the current CPU.
>> + *
>> + * weight: combined NUMA affinity benefit in moving the current
>> + * thread/process to the other CPU taking into account both the
>> higher
>> + * NUMA affinity of the current process if run on the other CPU, and
>> + * the increase in NUMA affinity in the other CPU by replacing the
>> + * other process.
>
> A lot of words, all meaningless without a proper definition of w_*
> stuff. How are they calculated and why.
>
>> + * We run the above math on every CPU not part of the current NUMA
>> + * node, and we compare the current process against the other
>> + * processes running in the other CPUs in the remote NUMA nodes. The
>> + * objective is to select the cpu (in selected_cpu) with a bigger
>> + * "weight". The bigger the "weight" the biggest gain we'll get by
>> + * moving the current process to the selected_cpu (not only the
>> + * biggest immediate CPU gain but also the fewer async memory
>> + * migrations that will be required to reach full convergence
>> + * later). If we select a cpu we migrate the current process to it.
>
> So you do something like:
>
> 	max_(i, node(i) != curr_node) { weight_i }
>
> That is, you have this weight, then what do you do?
>
>> + * Checking that the current process has higher NUMA affinity than
>> the
>> + * other process on the other CPU (w_nid>  w_other) and not only that
>> + * the current process has higher NUMA affinity on the other CPU than
>> + * on the current CPU (w_nid>  w_cpu_nid) completely avoids ping
>> pongs
>> + * and ensures (temporary) convergence of the algorithm (at least
>> from
>> + * a CPU standpoint).
>
> How does that follow?
>
>> + * It's then up to the idle balancing code that will run as soon as
>> + * the current CPU goes idle to pick the other process and move it
>> + * here (or in some other idle CPU if any).
>> + *
>> + * By only evaluating running processes against running processes we
>> + * avoid interfering with the CFS stock active idle balancing, which
>> + * is critical to optimal performance with HT enabled. (getting HT
>> + * wrong is worse than running on remote memory so the active idle
>> + * balancing has priority)
>
> what?
>
>> + * Idle balancing and all other CFS load balancing become NUMA
>> + * affinity aware through the introduction of
>> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
>> + * autonuma_node first when it needs to find idle CPUs during idle
>> + * balancing or tasks to pick during load balancing.
>
> You talk a lot about idle balance, but there's zero mention of fairness.
> This is worrysome.
>
>> + * The task's autonuma_node is the node selected by
>> + * sched_autonuma_balance() when it migrates a task to the
>> + * selected_cpu in the selected_nid
>
> I think I already said that strict was out of the question and hard
> movement like that simply didn't make sense.
>
>> + * Once a process/thread has been moved to another node, closer to
>> the
>> + * much of memory it has recently accessed,
>
> closer to the recently accessed memory you mean?
>
>>   any memory for that task
>> + * not in the new node moves slowly (asynchronously in the
>> background)
>> + * to the new node. This is done by the knuma_migratedN (where the
>> + * suffix N is the node id) daemon described in mm/autonuma.c.
>> + *
>> + * One non trivial bit of this logic that deserves an explanation is
>> + * how the three crucial variables of the core math
>> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
>> + * the other CPU is running a thread of the current process, or a
>> + * thread of a different process.
>
> No no no,.. its not a friggin detail, its absolutely crucial. Also, if
> you'd given proper definition you wouldn't need to hand wave your way
> around the dynamics either because that would simply follow from the
> definition.
>
> <snip terrible example>
>
>> + * Before scanning all other CPUs' runqueues to compute the above
>> + * math,
>
> OK, lets stop calling the one isolated conditional you mentioned 'math'.
> On its own its useless.
>
>>   we also verify that the current CPU is not already in the
>> + * preferred NUMA node from the point of view of both the process
>> + * statistics and the thread statistics. In such case we can return
>> to
>> + * the caller without having to check any other CPUs' runqueues
>> + * because full convergence has been already reached.
>
> Things being in the 'preferred' place don't have much to do with
> convergence. Does your model have local minima/maxima where it can get
> stuck, or does it always find a global min/max?
>
>
>> + * This algorithm might be expanded to take all runnable processes
>> + * into account but examining just the currently running processes is
>> + * a good enough approximation because some runnable processes may
>> run
>> + * only for a short time so statistically there will always be a bias
>> + * on the processes that uses most the of the CPU. This is ideal
>> + * because it doesn't matter if NUMA balancing isn't optimal for
>> + * processes that run only for a short time.
>
> Almost, but not quite.. it would be so if the sampling could be proven
> to be unbiased. But its quite possible for a task to consume most cpu
> time and never show up as the current task in your load-balance run.

Same here, I have another similar question regarding sampling:
If one process do very intensive visit of a small set of pages in this
node, but occasional visit of a large set of pages in another node.
Will this algorithm do a very bad judgment? I guess the answer would
be: it's possible and this judgment depends on the racing pattern
between the process and your knuma_scand.

Usually, if we are using sampling, we are on the assumption that if
this sampling would not be accurate, we only lose chance to
better optimization, but NOT to do bad/false judgment.

Andrea, sorry, I don't have enough time to look into all your patches
details(and also since I'm not on the CCs ;-) ),
But my intuition tells me that your current sampling and weight
algorithm is far from optimal.

>
>
>
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=ilto:"dont@...ck.org">  email@...ck.org</a>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/