lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 13 Mar 2019 19:39:26 -0700 From: Zi Yan <ziy@...dia.com> To: Anshuman Khandual <anshuman.khandual@....com>, Matthew Wilcox <willy@...radead.org>, Vlastimil Babka <vbabka@...e.cz> CC: <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>, Dave Hansen <dave.hansen@...ux.intel.com>, Michal Hocko <mhocko@...nel.org>, "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>, Andrew Morton <akpm@...ux-foundation.org>, Mel Gorman <mgorman@...hsingularity.net>, John Hubbard <jhubbard@...dia.com>, Mark Hairgrove <mhairgrove@...dia.com>, Nitin Gupta <nigupta@...dia.com>, David Nellans <dnellans@...dia.com> Subject: Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages. On 19 Feb 2019, at 20:38, Anshuman Khandual wrote: > On 02/19/2019 06:26 PM, Matthew Wilcox wrote: >> On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote: >>> But the location of this temp page matters as well because you would >>> like to >>> saturate the inter node interface. It needs to be either of the >>> nodes where >>> the source or destination page belongs. Any other node would >>> generate two >>> internode copy process which is not what you intend here I guess. >> That makes no sense. It should be allocated on the local node of the >> CPU >> performing the copy. If the CPU is in node A, the destination is in >> node B >> and the source is in node C, then you're doing 4k worth of reads from >> node C, >> 4k worth of reads from node B, 4k worth of writes to node C followed >> by >> 4k worth of writes to node B. Eventually the 4k of dirty cachelines >> on >> node A will be written back from cache to the local memory (... or >> not, >> if that page gets reused for some other purpose first). >> >> If you allocate the page on node B or node C, that's an extra 4k of >> writes >> to be sent across the inter-node link. > > Thats right there will be an extra remote write. My assumption was > that the CPU > performing the copy belongs to either node B or node C. I have some interesting throughput results for exchange per u64 and exchange per 4KB page. What I discovered is that using a 4KB page as the temporary storage for exchanging 2MB THPs does not improve the throughput. On contrary, when we are exchanging more than 2^4=16 THPs, exchanging per 4KB page has lower throughput than exchanging per u64. Please see results below. The experiments are done on a two socket machine with two Intel Xeon E5-2640 v3 CPUs. All exchanges are done via the QPI link across two sockets. Results === Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA nodes | 2mb_page_order | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | u64 | 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 | 9.57 | 9.62 | per_page | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 | 7.32 | 7.31 Normalized throughput (to per_page) 2mb_page_order | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 u64 | 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26 | 1.30 | 1.30 | 1.31 Exchange page code === For exchanging per u64, I use the following function: static void exchange_page(char *to, char *from) { u64 tmp; int i; for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { tmp = *((u64 *)(from + i)); *((u64 *)(from + i)) = *((u64 *)(to + i)); *((u64 *)(to + i)) = tmp; } } For exchange per 4KB, I use the following function: static void exchange_page2(char *to, char *from) { int cpu = smp_processor_id(); VM_BUG_ON(!in_atomic()); if (!page_tmp[cpu]) { int nid = cpu_to_node(cpu); struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0); if (!page_tmp_page) { exchange_page(to, from); return; } page_tmp[cpu] = kmap(page_tmp_page); } copy_page(page_tmp[cpu], to); copy_page(to, from); copy_page(from, page_tmp[cpu]); } where page_tmp is pre-allocated local to each CPU and alloc_pages_node() above is for hot-added CPUs, which is not used in the tests. The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc To do a comparison, you can clone this repo: https://gitlab.com/ziy/thp-migration-bench, then make, ./run_test.sh, and ./get_results.sh using the kernel from above. Let me know if I missed anything or did something wrong. Thanks. -- Best Regards, Yan Zi
Powered by blists - more mailing lists