[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dd4e07b-d2cf-63f2-fc0a-9b371b469a44@oracle.com>
Date: Mon, 16 Aug 2021 17:17:50 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
David Hildenbrand <david@...hat.com>,
Michal Hocko <mhocko@...e.com>,
Oscar Salvador <osalvador@...e.de>, Zi Yan <ziy@...dia.com>,
Muchun Song <songmuchun@...edance.com>,
Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
David Rientjes <rientjes@...gle.com>
Subject: Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
On 8/16/21 4:23 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@...cle.com> wrote:
>
>> This is a resend of PATCHes sent here [4]. There was some discussion
>> and interest when the RFC [5] was sent, but little after that. The
>> resend is just a rebase of [4] to next-20210816 with a few typos in
>> commmit messages fixed.
>>
>> Original Cover Letter
>> ---------------------
>> The concurrent use of multiple hugetlb page sizes on a single system
>> is becoming more common. One of the reasons is better TLB support for
>> gigantic page sizes on x86 hardware. In addition, hugetlb pages are
>> being used to back VMs in hosting environments.
>>
>> When using hugetlb pages to back VMs in such environments, it is
>> sometimes desirable to preallocate hugetlb pools. This avoids the delay
>> and uncertainty of allocating hugetlb pages at VM startup. In addition,
>> preallocating huge pages minimizes the issue of memory fragmentation that
>> increases the longer the system is up and running.
>>
>> In such environments, a combination of larger and smaller hugetlb pages
>> are preallocated in anticipation of backing VMs of various sizes. Over
>> time, the preallocated pool of smaller hugetlb pages may become
>> depleted while larger hugetlb pages still remain. In such situations,
>> it may be desirable to convert larger hugetlb pages to smaller hugetlb
>> pages.
>>
>> Converting larger to smaller hugetlb pages can be accomplished today by
>> first freeing the larger page to the buddy allocator and then allocating
>> the smaller pages. However, there are two issues with this approach:
>> 1) This process can take quite some time, especially if allocation of
>> the smaller pages is not immediate and requires migration/compaction.
>> 2) There is no guarantee that the total size of smaller pages allocated
>> will match the size of the larger page which was freed. This is
>> because the area freed by the larger page could quickly be
>> fragmented.
>>
>> To address these issues, introduce the concept of hugetlb page demotion.
>> Demotion provides a means of 'in place' splitting a hugetlb page to
>> pages of a smaller size. For example, on x86 one 1G page can be
>> demoted to 512 2M pages. Page demotion is controlled via sysfs files.
>> - demote_size Read only target page size for demotion
>
> Should this be "write only"? If not, I'm confused.
>
> If "yes" then "write only" would be a misnomer - clearly this file is
> readable (looks at demote_size_show()).
>
It is read only and is there mostly as information for the user. When
they demote a page, this is the size to which the page will be demoted.
For example,
# pwd
/sys/kernel/mm/hugepages/hugepages-1048576kB
# cat demote_size
2048kB
# pwd
/sys/kernel/mm/hugepages/hugepages-2048kB
# cat demote_size
4kB
The "demote size" is not user configurable. Although, that is
something brought up by Oscar previously. I did not directly address
this in the RFC. My bad. However, I do not like the idea of making
demote_size writable/selectable. My concern would be someone changing
the value and not resetting. It certainly is something that can be done
with minor code changes.
>> - demote Writable number of hugetlb pages to be demoted
>
> So how does this interface work? Write the target size to
> `demote_size', write the number of to-be-demoted larger pages to
> `demote' and then the operation happens?
>
> If so, how does one select which size pages should be selected for
> the demotion?
The location in the sysfs directory tells you what size pages will be
demoted. For example,
echo 5 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
says to demote 5 1GB pages.
demote files are also in node specific directories so you can even pick
huge pages from a specific node.
echo 5 >
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/demote
>
> And how does one know the operation has completed so the sysfs files
> can be reloaded for another operation?
>
When the write to the file is complete, the operation has completed.
Not exactly sure what you mean by reloading the sysfs files for
another operation?
>> Only hugetlb pages which are free at the time of the request can be demoted.
>> Demotion does not add to the complexity surplus pages. Demotion also honors
>> reserved huge pages. Therefore, when a value is written to the sysfs demote
>> file, that value is only the maximum number of pages which will be demoted.
>> It is possible fewer will actually be demoted.
>>
>> If demote_size is PAGESIZE, demote will simply free pages to the buddy
>> allocator.
>>
>> Real world use cases
>> --------------------
>> There are groups today using hugetlb pages to back VMs on x86. Their
>> use case is as described above. They have experienced the issues with
>> performance and not necessarily getting the excepted number smaller huge
>
> ("expected")
yes, will fix typo
>
>> pages after free/allocate cycle.
>>
>
> It seems odd to add the interfaces in patch 1 then document them in
> patch 5. Why not add-and-document in a single patch?
>
Yes, makes sense. Will combine these.
--
Mike Kravetz
Powered by blists - more mailing lists