linux-kernel - Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bb654219-8df6-60a7-3cf5-f886ef5ca565@redhat.com>
Date:   Mon, 5 Oct 2020 19:27:47 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Roman Gushchin <guro@...com>, Zi Yan <ziy@...dia.com>
Cc:     Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Rik van Riel <riel@...riel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Yang Shi <shy828301@...il.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        William Kucharski <william.kucharski@...cle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        John Hubbard <jhubbard@...dia.com>,
        David Nellans <dnellans@...dia.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

On 05.10.20 19:16, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>
>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>
>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>> have better control of their applications.
>>>>>>
>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>
>>>>>
>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>> to support. I can understand that there are some use cases that might
>>>>> benefit from it, especially:
>>>>
>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>> that can transparently split under memory pressure is a useful
>>>> funtionality. I cannot really judge how complex that would be
>>>
>>> Right, but that's then something different than serving (scarce,
>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>> multiple 2MB -> multiple single pages), for example, when having to
>>> migrate such a gigantic page. But that's very different from our
>>> existing gigantic page code as far as I can tell.
>>
>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>> which needs section size increase. In addition, unmoveable pages cannot
>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>> it than from ZONE_NORMAL.
> 
> s/higher chances/non-zero chances

Well, the longer the system runs (and consumes a significant amount of
available main memory), the less likely it is.

> 
> Currently we have nothing that prevents the fragmentation of the memory
> with unmovable pages on the 1GB scale. It means that in a common case
> it's highly unlikely to find a continuous GB without any unmovable page.
> As now CMA seems to be the only working option.
> 

And I completely dislike the use of CMA in this context (for example,
allocating via CMA and freeing via the buddy by patching CMA when
splitting up PUDs ...).

> However it seems there are other use cases for the allocation of continuous
> 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using
> 1GB pages can reduce the fragmentation of the direct mapping.

Yes, see RFC v1 where I already cced Mike.

> 
> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> E.g. something like a second level of pageblocks. That would allow to group
> all unmovable memory in few 1GB blocks and have more 1GB regions available for
> gigantic THPs and other use cases. I'm looking now into how it can be done.

Anything bigger than sections is somewhat problematic: you have to track
that data somewhere. It cannot be the section (in contrast to pageblocks)

> If anybody has any ideas here, I'll appreciate a lot.

I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
somewhat mimics what CMA does (when sized reasonably), works well with
memory hot(un)plug, and is immune to misconfiguration. Within such a
zone, we can try to optimize the placement of larger blocks.

-- 
Thanks,

David / dhildenb