linux-kernel - Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2613b670-84f8-4f97-ab4e-0d480fc1a3f8@linux.alibaba.com>
Date: Mon, 5 Feb 2024 21:06:17 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Michal Hocko <mhocko@...e.com>
Cc: akpm@...ux-foundation.org, muchun.song@...ux.dev, osalvador@...e.de,
 david@...hat.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when
 dissolving the old hugetlb



On 2/5/2024 5:15 PM, Michal Hocko wrote:
> On Mon 05-02-24 10:50:32, Baolin Wang wrote:
>>
>>
>> On 2/2/2024 5:55 PM, Michal Hocko wrote:
>>> On Fri 02-02-24 17:29:02, Baolin Wang wrote:
>>>> On 2/2/2024 4:17 PM, Michal Hocko wrote:
>>> [...]
>>>>>> Agree. So how about below changing?
>>>>>> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which
>>>>>> can ensure consistent behavior in handling hugetlb.
>>>>>
>>>>> I can see two cases here. alloc_contig_range which is an internal kernel
>>>>> user and then we have memory offlining. The former shouldn't break the
>>>>> per-node hugetlb pool reservations, the latter might not have any other
>>>>> choice (the whole node could get offline and that resembles breaking cpu
>>>>> affininty if the cpu is gone).
>>>>
>>>> IMO, not always true for memory offlining, when handling a free hugetlb, it
>>>> disallows fallbacking, which is inconsistent.
>>>
>>> It's been some time I've looked into that code so I am not 100% sure how
>>> the free pool is currently handled. The above is the way I _think_ it
>>> should work from the usability POV.
>>
>> Please see alloc_and_dissolve_hugetlb_folio().
> 
> This is the alloc_contig_range rather than offlining path. Page
> offlining migrates in-use pages to a _different_ node (as long as there is one
> available) via do_migrate_range and it disolves free hugetlb pages via
> dissolve_free_huge_pages. So the node's pool is altered but as this is
> an explicit offling operation I think there is not choice to go
> differently.
>   
>>>> Not only memory offlining, but also the longterm pinning (in
>>>> migrate_longterm_unpinnable_pages()) and memory failure (in
>>>> soft_offline_in_use_page()) can also break the per-node hugetlb pool
>>>> reservations.
>>>
>>> Bad
>>>
>>>>> Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
>>>>> users expectations but hugetlb migration already tries hard to allocate
>>>>> a replacement hugetlb so the system must be under a heavy memory
>>>>> pressure if that fails, right? Is it possible that the hugetlb
>>>>> reservation is just overshooted here? Maybe the memory is just terribly
>>>>> fragmented though?
>>>>>
>>>>> Could you be more specific about numbers in your failure case?
>>>>
>>>> Sure. Our customer's machine contains serveral numa nodes, and the system
>>>> reserves a large number of CMA memory occupied 50% of the total memory which
>>>> is used for the virtual machine, meanwhile it also reserves lots of hugetlb
>>>> which can occupy 50% of the CMA. So before starting the virtual machine, the
>>>> hugetlb can use 50% of the CMA, but when starting the virtual machine, the
>>>> CMA will be used by the virtual machine and the hugetlb should be migrated
>>>> from CMA.
>>>
>>> Would it make more sense for hugetlb pages to _not_ use CMA in this
>>> case? I mean would be better off overall if the hugetlb pool was
>>> preallocated before the CMA is reserved? I do realize this is just
>>> working around the current limitations but it could be better than
>>> nothing.
>>
>> In this case, the CMA area is large and occupies 50% of the total memory.
>> The purpose is that, if no virtual machines are launched, then CMA memory
>> can be used by hugetlb as much as possible. Once the virtual machines need
>> to be launched, it is necessary to allocate CMA memory as much as possible,
>> such as migrating hugetlb from CMA memory.
> 
> I am afraid that your assumption doesn't correspond to the existing
> implemntation. hugetlb allocations are movable but they are certainly
> not as movable as regular pages. So you have to consider a bigger
> margin and spare memory to achieve a more reliable movability.
> 
> Have you tried to handle this from the userspace. It seems that you know
> when there is the CMA demand to you could rebalance hugetlb pools at
> that moment, no?

Maybe this can help, but this just mitigates the issue ...

>> After more thinking, I think we should still drop the __GFP_THISNODE flag in
>> alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause
>> CMA allocation to fail, but it might also cause memory offline to fail like
>> I said in the commit message. Secondly, there have been no user reports
>> complaining about breaking the per-node hugetlb pool, although longterm
>> pinning, memory failure, and memory offline can potentially break the
>> per-node hugetlb pool.
> 
> It is quite possible that traditional users (like large DBs) do not use
> CMA heavily so such a problem was not observed so far. That doesn't mean
> those problems do not really matter.

CMA is just one case, as I mentioned before, other situations can also 
break the per-node hugetlb pool now.

Let's focus on the main point, why we should still keep inconsistency 
behavior to handle free and in-use hugetlb for alloc_contig_range()? 
That's really confused.