linux-kernel - [LSF/MM TOPIC] Generating physically contiguous memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CEDBC792-DE5A-42CB-AA31-40C039470BD0@nvidia.com>
Date:   Fri, 15 Feb 2019 14:20:37 -0800
From:   Zi Yan <ziy@...dia.com>
To:     <lsf-pc@...ts.linux-foundation.org>
CC:     <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        Michal Hocko <mhocko@...e.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Hugh Dickins <hughd@...gle.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        John Hubbard <jhubbard@...dia.com>,
        Mark Hairgrove <mhairgrove@...dia.com>,
        Nitin Gupta <nigupta@...dia.com>,
        David Nellans <dnellans@...dia.com>
Subject: [LSF/MM TOPIC] Generating physically contiguous memory

The Problem

----

Large pages and physically contiguous memory are important to devices, 
such as GPUs, FPGAs, NICs and RDMA controllers, because they can often 
reduce address translation overheads and hence achieve better 
performance when operating on large pages (2MB and beyond). The same can 
be said of CPU performance, of course, but there is an important 
difference: GPUs and high-throughput devices often take a more severe 
performance hit, in the event of a TLB miss, as compared to a CPU, 
because larger volume of in-flight work is stalled due to the TLB miss 
and the induced page table walks. The effect is sufficiently large that 
such devices *really* want highly reliable ways to allocate large pages 
to minimize TLB misses and reduce the duration of page table walks.

Due to the lack of flexibility, Approaches using memory reservation at 
boot time (such as hugetlbfs) are a compromise that would be nice to 
avoid. THPs, in general, seems to be a proper way to go because it is 
transparent to userspace and provides large pages, but it is not perfect 
yet. The community is still working on it since 1) THP size is limited 
by the page allocation system and 2) THP creation requires a lot of 
effort (e.g., memory compaction and page reclamation on the critical 
path of page allocations).

Possible solutions

----

1. I recently posted an RFC [1] about actively generating physically 
contiguous memory from in-use pages after page allocation. This RFC 
moves pages around and make them physically contiguous when possible. It 
is different from existing approaches, since it does not rely on page 
allocation. On the other hand, this approach is still affected by 
non-moveable pages scattered across the memory, which is highly related 
but orthogonal and one of whose possible solutions is proposed by Mel 
Gorman recently [2].

2. THPs could be a solution as it provide large pages. THP avoids memory 
reservation at boot time, but to meet the needs, i.e., a lot of large 
pages, of some of these high-throughput accelerators, we need to make it 
easier to produce large pages, namely increasing the successful rate of 
allocating THPs and decreasing the overheads of allocating them. Mel 
Gorman has posted a related patchset [3].

It is also possible to generate THPs in the background, either like what 
khugepaged does right now, or periodically perform memory compaction to 
lower whole memory fragmentation level, or having certain amount of THP 
pools for future use. But these solutions still face the same problem.

3. A more restricted but more reliable way might be using libhugetlbfs. 
It reserves memory, which is dedicated to large page allocations and 
hence requires less effort to obtain large pages. It also supports page 
sizes larger than 2MB, which further reduces address translation 
overheads. But AFAIK device drivers are not able to directly grab large 
pages from libhugetlbfs, which is something devices want.

4. Recently Matthew Wilcox mentioned his XArray is going to support 
arbitrary sized pages [4], which would help maintain physically 
contiguous ranges once created (aka my RFC). Once my RFC generates 
physically contiguous memory, XArrays would maintain the page size and 
prevent reclaim/compaction from breaking them. Getting arbitrary sized 
pages can still be beneficial to devices when larger than 2MB pages 
becomes very difficult to get.

Feel free to provide your comments.

Thanks.

[1] https://lore.kernel.org/lkml/20190215220856.29749-1-zi.yan@sent.com/

[2] 
https://lore.kernel.org/lkml/20181123114528.28802-1-mgorman@techsingularity.net/

[3] 
https://lore.kernel.org/lkml/20190118175136.31341-1-mgorman@techsingularity.net/

[4] 
https://lore.kernel.org/lkml/20190208042448.GB21860@bombadil.infradead.org/

--
Best Regards,
Yan Zi