[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <BFE00D00-1DFF-4AF8-BB10-5B76BF06E012@linux.dev>
Date: Mon, 11 Sep 2023 11:10:18 +0800
From: Muchun Song <muchun.song@...ux.dev>
To: Mike Kravetz <mike.kravetz@...cle.com>
Cc: Linux-MM <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
Muchun Song <songmuchun@...edance.com>,
Joao Martins <joao.m.martins@...cle.com>,
Oscar Salvador <osalvador@...e.de>,
David Hildenbrand <david@...hat.com>,
Miaohe Lin <linmiaohe@...wei.com>,
David Rientjes <rientjes@...gle.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
Barry Song <song.bao.hua@...ilicon.com>,
Michal Hocko <mhocko@...e.com>,
Matthew Wilcox <willy@...radead.org>,
Xiongchun Duan <duanxiongchun@...edance.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list
of pages
> On Sep 9, 2023, at 04:53, Mike Kravetz <mike.kravetz@...cle.com> wrote:
>
> On 09/07/23 11:54, Mike Kravetz wrote:
>> On 09/07/23 11:33, Muchun Song wrote:
>>>
>>>
>>>> On Sep 7, 2023, at 05:12, Mike Kravetz <mike.kravetz@...cle.com> wrote:
>>>>
>>>> On 09/06/23 16:07, Muchun Song wrote:
>>>>>> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@...ux.dev> wrote:
>>>>>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>>>>>> When removing hugetlb pages from the pool, we first create a list
>>>>>>> of removed pages and then free those pages back to low level allocators.
>>>>>>> Part of the 'freeing process' is to restore vmemmap for all base pages
>>>>>>> if necessary. Pass this list of pages to a new routine
>>>>>>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
>>>>>>> performed in bulk.
>>>>>>>
>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@...cle.com>
>>>>>>> ---
>>>>>>> mm/hugetlb.c | 3 +++
>>>>>>> mm/hugetlb_vmemmap.c | 13 +++++++++++++
>>>>>>> mm/hugetlb_vmemmap.h | 5 +++++
>>>>>>> 3 files changed, 21 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index 554be94b07bd..dd2dbc256172 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>>>>>>> {
>>>>>>> struct folio *folio, *t_folio;
>>>>>>> + /* First restore vmemmap for all pages on list. */
>>>>>>> + hugetlb_vmemmap_restore_folios(h, list);
>>>>>>> +
>>>>>>> list_for_each_entry_safe(folio, t_folio, list, lru) {
>>>>>>> update_and_free_hugetlb_folio(h, folio, false);
>>>>>>> cond_resched();
>>>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>>>>> index ac5577d372fe..79de984919ef 100644
>>>>>>> --- a/mm/hugetlb_vmemmap.c
>>>>>>> +++ b/mm/hugetlb_vmemmap.c
>>>>>>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>>>>>>> return ret;
>>>>>>> }
>>>>>>> +/*
>>>>>>> + * This function will attempt to resore vmemmap for a list of folios. There
>>>>>>> + * is no guarantee that restoration will be successful for all or any folios.
>>>>>>> + * This is used in bulk operations, and no feedback is given to the caller.
>>>>>>> + */
>>>>>>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
>>>>>>> +{
>>>>>>> + struct folio *folio;
>>>>>>> +
>>>>>>> + list_for_each_entry(folio, folio_list, lru)
>>>>>>> + (void)hugetlb_vmemmap_restore(h, &folio->page);
>>>>>>
>>>>>> I am curious about the purpose of "void" here, seems it it not necessnary,
>>>>>> ritgh? We cound see so many palces where we do not add the void if the caller
>>>>>> does not care about the return value of the callee.
>>>>>
>>>>> Another question: should we stop restoring vmemmap pages when
>>>>> hugetlb_vmemmap_restore() fails? In which case, I suspect there
>>>>> is no memory probably, there is no need to continue, right?
>>>>
>>>> Recall that the list of hugetlb pages may be from multiple nodes. My first
>>>> thought was that we should continue because memory allocation may fail on one
>>>> node but succeed on another. However, with
>>>> https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@huawei.com/
>>>> memory allocation should fall back to other nodes. So, yes I do believe it
>>>> would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as
>>>> we are unlikely to make forward progress.
>>>
>>> Agree.
>>>
>>>>
>>>> Today's behavior will try to restore vmemmap for all pages. No stopping
>>>> on error.
>>>>
>>>> I have mixed thoughts on this. Quitting on error 'seems reasonable'.
>>>> However, if we continue we 'might' be able to allocate vmemmap for one
>>>> hugetlb page. And, if we free one hugetlb page that should provide
>>>> vmemmap for several more and we may be able to free most pages on the
>>>> list.
>>>
>>> Yes. A good point. But there should be a non-optimized huge page been
>>> freed somewhere in parallel, otherwise we still cannot allocate memory.
>>
>> It does not have to be another huge page being freed in parallel. It
>> could be that when allocating vmemmap for a 1G hugetlb page we were one
>> (4K) page short of what was required. If someone else frees a 4K page,
>> freeing the next 1G page may succeed.
Right. I missed that.
>> --
>> Mike Kravetz
>>
>>> However, the freeing operation happens after hugetlb_vmemmap_restore_folios.
>>> If we want to handle this, we should rework update_and_free_pages_bulk()
>>> to do a try when at least a huge pages is freed.
>
> This seemed familiar. Recall this patch which Muchun Reviewed and James Acked,
> https://lore.kernel.org/linux-mm/20230718004942.113174-3-mike.kravetz@oracle.com/
>
> If we can not restore vmemmap for a page, then it must be turned into a
> surplus huge page. In this patch (not the previous one referenced), we
> will try to restore vmemmap one more time in a later call to
> update_and_free_hugetlb_folio. Certainly, we do not want to try twice!
>
> My 'plan' is to include the previous patch as part of this series. With
> that patch in place, the list_for_each_entry calls to hugetlb_vmemmap_restore
> can be replaced with a call to hugetlb_vmemmap_restore_folios. We would
> change the behavior of hugetlb_vmemmap_restore_folios to return an error
> instead of being of type void. If an error is returned, then we will
> make another pass through the list looking for unoptimized pages and add
> them as surplus.
>
> I think it best if we try to restore vmemmap at least once before
> converting to a surplus page.
Make sense.
Muchun
> --
> Mike Kravetz
Powered by blists - more mailing lists