[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ccf05348-e1b6-58a7-2626-701e60b662e6@redhat.com>
Date: Tue, 2 Nov 2021 14:41:25 +0100
From: David Hildenbrand <david@...hat.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Alexey Makhalov <amakhalov@...are.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"stable@...r.kernel.org" <stable@...r.kernel.org>,
Oscar Salvador <OSalvador@...e.com>
Subject: Re: [PATCH] mm: fix panic in __alloc_pages
On 02.11.21 14:25, Michal Hocko wrote:
> On Tue 02-11-21 13:39:06, David Hildenbrand wrote:
>>>> Yes, but a zonelist cannot be correct for an offline node, where we might
>>>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
>>>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
>>>
>>> Yes, that is what I had in mind. We are talking about two things here.
>>> Memoryless nodes and offline nodes. The later sounds like a bug to me.
>>
>> Agreed. memoryless nodes should just have proper zonelists -- which
>> seems to be the case.
>>
>>>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
>>>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
>>>> preferred nid.
>>>
>>> Historically, those allocation interfaces were not trying to be robust
>>> against wrong inputs because that adds cpu cycles for everybody for
>>> "what if buggy" code. This has worked (surprisingly) well. Memory less
>>> nodes have brought in some confusion but this is still something that we
>>> can address on a higher level. Nobody give arbitrary nodes as an input.
>>> cpu_to_node might be tricky because it can point to a memory less node
>>> which along with __GFP_THISNODE is very likely not something anybody
>>> wants. Hence cpu_to_mem should be used for allocations. I hate we have
>>> two very similar APIs...
>>
>> To be precise, I'm wondering if we should do:
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 55b2ec1f965a..8c49b88336ee 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -565,7 +565,7 @@ static inline struct page *
>> __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> {
>> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
>> + VM_WARN_ON(!node_online(nid));
>>
>> return __alloc_pages(gfp_mask, order, nid, NULL);
>> }
>>
>> (Or maybe VM_BUG_ON)
>>
>> Because it cannot possibly work and we'll dereference NULL later.
>
> VM_BUG_ON would be silent for most configurations and crash would happen
> even without it so I am not sure about the additional value. VM_WARN_ON
> doesn't really add much on top - except it would crash in some
> configurations. If we really care to catch this case then we would have
> to do a reasonable fallback with a printk note and a dumps stack.
As I learned, VM_BUG_ON and friends are active for e.g., Fedora, which
can catch quite some issues early, before they end up in enterprise
distro kernels. I think it has value.
> Something like
> if (unlikely(!node_online(nid))) {
> pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n");
> dump_stack();
> nid = numa_mem_id();
> }
>
> But again this is adding quite some cycles to a hotpath of the page
> allocator. Is this worth it?
Don't think a fallback makes sense.
>
>>> But something seems wrong in this case. cpu_to_node shouldn't return
>>> offline nodes. That is just a land mine. It is not clear to me how the
>>> cpu has been brought up so that the numa node allocation was left
>>> behind. As pointed in other email add_cpu resp. cpu_up is not it.
>>> Is it possible that the cpu bring up was only half way?
>>
>> I tried to follow the code (what sets a CPU present, what sets a CPU
>> online, when do we update cpu_to_node() mapping) and IMHO it's all a big
>> mess. Maybe it's clearer to people familiar with that code, but CPU
>> hotplug in general seems to be a confusing piece of (arch-specific) code.
>
> Yes there are different arch specific parts that make this quite hard to
> follow.
>
> I think we want to learn how exactly Alexey brought that cpu up. Because
> his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> Or I am just not following the code properly. Once we know all those
> details we can get in touch with cpu hotplug maintainers and see what
> can we do.
Yes.
>
> Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?
cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
won't help for this very report.
> One last thing, there were some mentions of __GFP_THISNODE but I fail to
> see connection with the pcp allocator...
Me to. If pcpu would be using __GFP_THISNODE, we'd be hitting the
VM_WARN_ON but still crash.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists