lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:	Fri, 13 Jun 2014 16:29:19 -0400
From:	Rik van Riel <riel@...hat.com>
To:	Linux kernel Mailing List <linux-kernel@...r.kernel.org>
CC:	Peter Zijlstra <peterz@...radead.org>,
	Mel Gorman <mgorman@...e.de>,
	"Vinod, Chegu" <chegu_vinod@...com>, Karen Noel <knoel@...hat.com>,
	Hai Huang <hhuang@...hat.com>,
	Andrea Arcangeli <aarcange@...hat.com>
Subject: NUMA: untangling workloads on undersubscribed systems

I am still running into a long-standing system with the NUMA code, and
I am out of obvious ideas on how to fix it...

The scenario:
- a larger NUMA system, in this case an 8 core system with 8
  15-core/32-thread CPUs (ns->capacity == 18)
- 8 16-warehouse SPECjbb2005 instances
- two SPECjbb2005 instances getting stuck largely on the same node

er-node process memory usage (in MBs)
PID                        Node 0          Node 1          Node 2	   Node 3
        Node 4          Node 5          Node 6          Node 7
 Total
----------------  --------------- --------------- ---------------
--------------- -
-------------- --------------- --------------- ---------------
---------------
42765 (java)                16.90          580.37            6.95
     5.44       2632.76            1.88            7.12            3.46
        3254.89
42761 (java)                 8.72           23.09           46.19
    12.64       3126.64           14.61            2.96            3.76
        3238.60


The latter process is nicely concentrated on node 5. The first process
merely has most of its memory on node 5, but a good amount on node 1
as well.

The total number of threads that would like to run on node 5 is 32,
which exceeds both the number of threads node 5 has (30), as well as
ns->capacity for the node (18).

Node 1 is mostly idle, with about 4 of the task's 16 threads.
Numatop reports around a .4-.5 ratio of remote to local memory
accesses.

The question is, how do we decide to move more tasks from node 5 to
node 1, especially ones that have a decent group/task_score elsewhere?

We can detect some things:
1) ns->nr_running > ns->capacity on node 5
2) ns->nr_running < ns->capacity on node 1
3) ns->load on node 5 >> ns->load on node 1
4) group/task_score on node 5 >> group/task_score on node 1

A few quick things I can see is:

Node 5 is overloaded by a ratio of (16+14)/18, or about 1.7

Node 5 has about a 4.5x higher group/task_score than node 1

Node 5 has about a 7.5x higher load than node 1

Maybe task_numa_compare can take the load into account not just
to prohibit moves between nodes, but to actively encourage it when
the load difference significantly outweighs the difference in NUMA
score between nodes?

Would it make sense to compare these things?

score(node5)    score(node1)
------------ vs ------------
load(node5)     load(node1)

Maybe only if one node is overloaded?

Do you guys have any other ideas?




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ