linux-kernel - RE: [PATCH v2] RFC: clear 1G pages with streaming stores on x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <DF4PR8401MB11806B9D2A7FE04B1F5ECBF8AB540@DF4PR8401MB1180.NAMPRD84.PROD.OUTLOOK.COM>
Date:   Wed, 25 Jul 2018 05:02:46 +0000
From:   "Elliott, Robert (Persistent Memory)" <elliott@....com>
To:     Cannon Matthews <cannonmatthews@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
CC:     "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Andres Lagar-Cavilla <andreslc@...gle.com>,
        Salman Qazi <sqazi@...gle.com>, Paul Turner <pjt@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Peter Feiner <pfeiner@...gle.com>,
        Alain Trinh <nullptr@...gle.com>
Subject: RE: [PATCH v2] RFC: clear 1G pages with streaming stores on x86



> -----Original Message-----
> From: linux-kernel-owner@...r.kernel.org <linux-kernel-
> owner@...r.kernel.org> On Behalf Of Cannon Matthews
> Sent: Tuesday, July 24, 2018 9:37 PM
> Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on
> x86
> 
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache
> anyway.
>
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take
> on average 134 seconds, about 260ms/GiB which is quite slow. Using
> `movnti` and optimizing the control flow over the constituent small
> pages, this can be improved roughly by a factor of 3-4x, with the
> 512GiB mlock() taking only 34 seconds on average, or 67ms/GiB.

...
> - Are there any obvious pitfalls or caveats that have not been
> considered? 

Note that Kirill attempted something like this in 2012 - see
https://www.spinics.net/lists/linux-mm/msg40575.html

...
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,29 @@
> +#include <asm/page.h>
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) ||
> defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> +				unsigned long addr,

The previous attempt used cacheable stores in the page containing
addr to prevent an inevitable cache miss after the clearing completes.
This function is not using addr at all.

> +				unsigned int pages_per_huge_page)
> +{
> +	int i;
> +	void *dest = page_to_virt(page);
> +	int resched_count = 0;
> +
> +	BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> +	BUG_ON(!dest);

Are those really possible conditions?  Is there a safer fallback
than crashing the whole kernel?

> +
> +	for (i = 0; i < pages_per_huge_page; i +=
> PAGES_BETWEEN_RESCHED) {
> +		__clear_page_nt(dest + (i * PAGE_SIZE),
> +				PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> +		resched_count += cond_resched();
> +	}
> +	/* __clear_page_nt requrires and `sfence` barrier. */

requires an

...
> diff --git a/arch/x86/lib/clear_page_64.S
...
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> +	leaq	(%rdi,%rsi), %rdx
> +	xorl	%eax, %eax
> +	.p2align 4,,10
> +	.p2align 3
> +.L2:
> +	movnti	%rax, (%rdi)
> +	addq	$8, %rdi

Also consider using the AVX vmovntdq instruction (if available), the
most recent of which does 64-byte (cache line) sized transfers to
zmm registers. There's a hefty context switching overhead (e.g.,
304 clocks), but it might be worthwhile for 1 GiB (which
is 16,777,216 cache lines).

glibc memcpy() makes that choice for transfers > 75% of the L3 cache
size divided by the number of cores.  (last I tried, it was still
selecting "rep stosb" for large memset()s, although it has an
AVX-512 function available)

Even with that, one CPU core won't saturate the memory bus; multiple
CPU cores (preferably on the same NUMA node as the memory) need to
share the work.

---
Robert Elliott, HPE Persistent Memory