[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d3b98e2b-2148-172a-358c-e7ab1e444c3b@oracle.com>
Date: Mon, 4 Apr 2022 16:48:35 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: David Hildenbrand <david@...hat.com>,
Peng Liu <liupeng256@...wei.com>, akpm@...ux-foundation.org,
yaozhenguo1@...il.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH v2 1/2] hugetlb: Fix hugepages_setup when deal with
pernode
On 4/4/22 03:41, David Hildenbrand wrote:
> On 01.04.22 19:23, Mike Kravetz wrote:
>> On 4/1/22 03:43, David Hildenbrand wrote:
>>> On 01.04.22 12:12, Peng Liu wrote:
>>>> Hugepages can be specified to pernode since "hugetlbfs: extend
>>>> the definition of hugepages parameter to support node allocation",
>>>> but the following problem is observed.
>>>>
>>>> Confusing behavior is observed when both 1G and 2M hugepage is set
>>>> after "numa=off".
>>>> cmdline hugepage settings:
>>>> hugepagesz=1G hugepages=0:3,1:3
>>>> hugepagesz=2M hugepages=0:1024,1:1024
>>>> results:
>>>> HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
>>>> HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages
>>>>
>>>> Furthermore, confusing behavior can be also observed when invalid
>>>> node behind valid node.
>>>>
>>>> To fix this, hugetlb_hstate_alloc_pages should be called even when
>>>> hugepages_setup going to invalid.
>>>
>>> Shouldn't we bail out if someone requests node-specific allocations but
>>> we are not running with NUMA?
>>
>> I thought about this as well, and could not come up with a good answer.
>> Certainly, nobody SHOULD specify both 'numa=off' and ask for node specific
>> allocations on the same command line. I would have no problem bailing out
>> in such situations. But, I think that would also require the hugetlb command
>> line processing to look for such situations.
>
> Yes. Right now I see
>
> if (tmp >= nr_online_nodes)
> goto invalid;
>
> Which seems a little strange, because IIUC, it's the number of online
> nodes, which is completely wrong with a sparse online bitmap. Just
> imagine node 0 and node 2 are online, and node 1 is offline. Assuming
> that "node < 2" is valid is wrong.
>
> Why don't we check for node_online() and bail out if that is not the
> case? Is it too early for that check? But why does comparing against
> nr_online_nodes() work, then?
>
>
> Having that said, I'm not sure if all usage of nr_online_nodes in
> mm/hugetlb.c is wrong, with a sparse online bitmap. Outside of that,
> it's really just used for "nr_online_nodes > 1". I might be wrong, though.
I think you are correct. My bad for not being more thorough in reviewing
the original patch that added this code. My incorrect assumption was that
a sparse node map was only possible via offline operations which could not
happen this early in boot. I now see that a sparse map can be presented
by fw/bios/etc. So, yes I do believe we need to check for online nodes.
--
Mike Kravetz
>
>>
>> One could also argue that if there is only a single node (not numa=off on
>> command line) and someone specifies node local allocations we should bail.
>
> I assume "numa=off" is always parsed before hugepages_setup() is called,
> right? So we can just rely on the actual numa information.
>
>
>>
>> I was 'thinking' about a situation where we had multiple nodes and node
>> local allocations were 'hard coded' via grub or something. Then, for some
>> reason one node fails to come up on a reboot. Should we bail on all the
>> hugetlb allocations, or should we try to allocate on the still available
>> nodes?
>
> Depends on what "bail" means. Printing a warning and stopping to
> allocate further is certainly good enough for my taste :)
>
>>
>> When I went back and reread the reason for this change, I see that it is
>> primarily for 'some debugging and test cases'.
>>
>>>
>>> What's the result after your change?
>>>
>>>>
>>>> Cc: <stable@...r.kernel.org>
>>>
>>> I am not sure if this is really stable material.
>>
>> Right now, we partially and inconsistently process node specific allocations
>> if there are missing nodes. We allocate 'regular' hugetlb pages on existing
>> nodes. But, we do not allocate gigantic hugetlb pages on existing nodes.
>>
>> I believe this is worth fixing in stable.
>
> I am skeptical.
>
> https://www.kernel.org/doc/Documentation/process/stable-kernel-rules.rst
>
> " - It must fix a real bug that bothers people (not a, "This could be a
> problem..." type thing)."
>
> While the current behavior is suboptimal, it's certainly not an urgent
> bug (?) and the kernel will boot and work just fine. As you mentioned
> "nobody SHOULD specify both 'numa=off' and ask for node specific
> allocations on the same command line.", this is just a corner case.
>
> Adjusting it upstream -- okay. Backporting to stable? I don't think so.
>
Powered by blists - more mailing lists