[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295A8CE70F@G9W0745.americas.hpqcorp.net>
Date: Sat, 2 May 2015 11:52:18 +0000
From: "Elliott, Robert (Server Storage)" <Elliott@...com>
To: Daniel J Blueman <daniel@...ascale.com>, nzimmer <nzimmer@....com>,
"Mel Gorman" <mgorman@...e.de>
CC: Pekka Enberg <penberg@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...el.com>,
"Long, Wai Man" <waiman.long@...com>,
"Norton, Scott J" <scott.norton@...com>,
Linux-MM <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
'Steffen Persvold' <sp@...ascale.com>,
"Boaz Harrosh (boaz@...xistor.com)" <boaz@...xistor.com>,
"dan.j.williams@...el.com" <dan.j.williams@...el.com>,
"linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>
Subject: RE: [PATCH 0/13] Parallel struct page initialisation v4
> -----Original Message-----
> From: linux-kernel-owner@...r.kernel.org [mailto:linux-kernel-
> owner@...r.kernel.org] On Behalf Of Daniel J Blueman
> Sent: Thursday, April 30, 2015 11:10 AM
> Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4
...
> On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
> seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70% reduction
> with this patchset. Non-temporal PMD init [1] drops this to 1045s.
>
> Nathan, what do you guys see with the non-temporal PMD patch [1]? Do
> add a sfence at the ende label if you manually patch.
>
...
> [1] https://lkml.org/lkml/2015/4/23/350
From that post:
> +loop_64:
> + decq %rcx
> + movnti %rax,(%rdi)
> + movnti %rax,8(%rdi)
> + movnti %rax,16(%rdi)
> + movnti %rax,24(%rdi)
> + movnti %rax,32(%rdi)
> + movnti %rax,40(%rdi)
> + movnti %rax,48(%rdi)
> + movnti %rax,56(%rdi)
> + leaq 64(%rdi),%rdi
> + jnz loop_64
There are some even more efficient instructions available in x86,
depending on the CPU features:
* movnti 8 byte
* movntdq %xmm 16 byte, SSE
* vmovntdq %ymm 32 byte, AVX
* vmovntdq %zmm 64 byte, AVX-512 (forthcoming)
The last will transfer a full cache line at a time.
For NVDIMMs, the nd pmem driver is also in need of memcpy functions that
use these non-temporal instructions, both for performance and reliability.
We also need to speed up __clear_page and copy_user_enhanced_string so
userspace accesses through the page cache can keep up.
https://lkml.org/lkml/2015/4/2/453 is one of the threads on that topic.
Some results I've gotten there under different cache attributes
(in terms of 4 KiB IOPS):
16-byte movntdq:
UC write iops=697872 (697.872 K)(0.697872 M)
WB write iops=9745800 (9745.8 K)(9.7458 M)
WC write iops=9801800 (9801.8 K)(9.8018 M)
WT write iops=9812400 (9812.4 K)(9.8124 M)
32-byte vmovntdq:
UC write iops=1274400 (1274.4 K)(1.2744 M)
WB write iops=10259000 (10259 K)(10.259 M)
WC write iops=10286000 (10286 K)(10.286 M)
WT write iops=10294000 (10294 K)(10.294 M)
---
Robert Elliott, HP Server Storage
Powered by blists - more mailing lists