linux-kernel - Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <abb85115-5790-c292-f27e-3c13b105230d@nvidia.com>
Date:   Mon, 13 Nov 2023 09:52:29 -0500
From:   John Hubbard <jhubbard@...dia.com>
To:     Ryan Roberts <ryan.roberts@....com>,
        Matthew Wilcox <willy@...radead.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Yin Fengwei <fengwei.yin@...el.com>,
        David Hildenbrand <david@...hat.com>,
        Yu Zhao <yuzhao@...gle.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Yang Shi <shy828301@...il.com>,
        "Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Itaru Kitayama <itaru.kitayama@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Hugh Dickins <hughd@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 11/13/23 2:19 AM, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of

We've found that a 64KB base page size provides better performance for
HPC and AI workloads, than a 4KB base size, at least for these kinds of
servers. In fact, the 4KB config is considered odd and I'd have to
look around to get one. It's mostly a TLB coverage issue because,
again, the problem typically has a very large memory footprint.

So even though it would be nice from a software point of view, there's
a real need for this.

>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

Yes, and to provide some further details about the test runs, I went
so far as to test individual anon_orders (for example, 
anon_orders=0x20), in order to isolate behavior and see what's really
going on.

On this hardware, anything with 2MB page sizes which corresponds to
anon_orders=0x20, as I recall) or larger, gets the 10x boost. It's
an interesting on/off behavior. This particular server design and
workload combination really prefers 2MB pages, even if they are
held together with contpte instead of a real PMD entry.

> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries. >
> Thanks,
> Ryan
> 

thanks,

-- 
John Hubbard
NVIDIA