linux-kernel - NUMA, migrate/N, and tuned-adm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKz8sYWmmmdwKTKUGdoyaMD_v9_7yQydG+UNAb2r-KoO4L+6Ng@mail.gmail.com>
Date:	Tue, 17 Dec 2013 10:10:42 -0800
From:	David Timothy Strauss <david@...idstrauss.net>
To:	Mel Gorman <mgorman@...e.de>, Ingo Molnar <mingo@...nel.org>,
	Rik van Riel <riel@...hat.com>
Cc:	linux-kernel@...r.kernel.org
Subject: NUMA, migrate/N, and tuned-adm

Our system gets storms of migrate/N (and sometimes kswapd) tasks from
the kernel, based on what we've seen in top [1]. This issue is unique
to our hardware application servers; we run hundreds of application
servers on Xen virtual hardware without this issue and the same
kernel. We also have no issues with identical kernels and hardware
servers while running databases.

System specs:
 * Fedora 19 with the 3.11.10-200.fc19.x86_64 kernel (just the stock RPM)
 * Bare-metal servers with 128GB RAM split between two NUMA regions,
each region with one hex-core processor
 * More than 700 processes, a couple hundred of which are active
fairly frequently. The systems were at 7000 processes, but we've
dropped it while we dive into this issue.
 * Many of the processes are short-lived. The long-lived ones
experience spikes in CPU and memory usage while processing requests.

Here's what we've tried, to no avail:
 * tuned-adm on latency-performance and virtual-host profiles; this
places the system on the deadline scheduler, but this problem occurred
on the default one too
 * kernel.sched_migration_cost_ns=5000000 (which tuned will do for
those profiles in v3.3/Fedora 20)
 * numad to balance between regions
 * Global use of sched_relax_domain_level=1 and sched_relax_domain_level=2
 * Splitting the system with cpuset into management tasks (6 virtual
cores) and workload tasks (18 virtual cores) with
sched_relax_domain_level=2. This is based on recommendations for NUMA
systems in the cpuset man page.

Here's what we've used for analysis:
 * powertop
 * top/htop
 * perf record -a -g
 * SystemTap with code to print out migrations occurring
 * numatop

[1] https://gist.github.com/davidstrauss/3ff0b29c4d3766bedd49

David Strauss
Pantheon Systems
Fedora Server Working Group

P.S. Josh Boyer (jwb) referred me here from the Fedora kernel side.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/