linux-kernel - Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Fri, 2 Feb 2024 10:55:05 +0100
From: Michal Hocko <mhocko@...e.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: akpm@...ux-foundation.org, muchun.song@...ux.dev, osalvador@...e.de,
	david@...hat.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when
 dissolving the old hugetlb

On Fri 02-02-24 17:29:02, Baolin Wang wrote:
> On 2/2/2024 4:17 PM, Michal Hocko wrote:
[...]
> > > Agree. So how about below changing?
> > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which
> > > can ensure consistent behavior in handling hugetlb.
> > 
> > I can see two cases here. alloc_contig_range which is an internal kernel
> > user and then we have memory offlining. The former shouldn't break the
> > per-node hugetlb pool reservations, the latter might not have any other
> > choice (the whole node could get offline and that resembles breaking cpu
> > affininty if the cpu is gone).
> 
> IMO, not always true for memory offlining, when handling a free hugetlb, it
> disallows fallbacking, which is inconsistent.

It's been some time I've looked into that code so I am not 100% sure how
the free pool is currently handled. The above is the way I _think_ it
should work from the usability POV.

> Not only memory offlining, but also the longterm pinning (in
> migrate_longterm_unpinnable_pages()) and memory failure (in
> soft_offline_in_use_page()) can also break the per-node hugetlb pool
> reservations.

Bad

> > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
> > users expectations but hugetlb migration already tries hard to allocate
> > a replacement hugetlb so the system must be under a heavy memory
> > pressure if that fails, right? Is it possible that the hugetlb
> > reservation is just overshooted here? Maybe the memory is just terribly
> > fragmented though?
> > 
> > Could you be more specific about numbers in your failure case?
> 
> Sure. Our customer's machine contains serveral numa nodes, and the system
> reserves a large number of CMA memory occupied 50% of the total memory which
> is used for the virtual machine, meanwhile it also reserves lots of hugetlb
> which can occupy 50% of the CMA. So before starting the virtual machine, the
> hugetlb can use 50% of the CMA, but when starting the virtual machine, the
> CMA will be used by the virtual machine and the hugetlb should be migrated
> from CMA.

Would it make more sense for hugetlb pages to _not_ use CMA in this
case? I mean would be better off overall if the hugetlb pool was
preallocated before the CMA is reserved? I do realize this is just
working around the current limitations but it could be better than
nothing.

> Due to several nodes in the system, one node's memory can be exhausted,
> which will fail the hugetlb migration with __GFP_THISNODE flag.

Is the workload NUMA aware? I.e. do you bind virtual machines to
specific nodes?

-- 
Michal Hocko
SUSE Labs