[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com>
Date: Mon, 6 Dec 2021 13:43:27 +0100
From: David Hildenbrand <david@...hat.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Nico Pache <npache@...hat.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, akpm@...ux-foundation.org, shakeelb@...gle.com,
ktkhai@...tuozzo.com, shy828301@...il.com, guro@...com,
vbabka@...e.cz, vdavydov.dev@...il.com, raquini@...hat.com
Subject: Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on
offlined nodes
>> I'm not sure if it's rally whack-a-mole really applies. It's just the
>> for_each_node_* calls that need love. In other cases, we shouldn't
>> really stumble over an offline node.
>>
>> Someone deliberately decided to use "for_each_node()" instead of
>> for_each_online_node() without taking care of online vs. offline
>> semantics. That's just a BUG and needs fixing IMHO.
>
> Well, I would argue this is too much to ask for the API users. It is
> also a trap that is just too easy to fall into. Turning for_each_node
> into for_each_online_node will not solve the problem just by itself.
> As you have pointed out an offline node can become online and without a
> hotplug notifier there is no way for_each_online_node would be a proper
> fit for anything that allocates memory only for online nodes. See the
> trap? People think they are doing the right thing but they are not in
> fact.
I agree that it requires care in the caller. And some callers don't seem
to know that the nid they hold in their hand is garbage and should be
used with care. And this is pretty much undocumented.
>
> Now practically speaking !node_online should not apear node_online (note
> I am attentionally avoiding to say offline and online as that has a
> completely different semantic) shouldn't really happen for some
> architectures. x86 should allocate pgdat for each possible node. I do
> not know what was the architecture in this case but we already have
> another report for x86 that remains unexplained.
So we'd allocate the pgdat although all we want is just a zonelist. The
obvious alternative is to implement the fallback where reasonable -- for
example, in the page allocator. It knows the fallback order:
build_zonelists(). That's pretty much all we need the preferred_nid for.
So just making prepare_alloc_pages()/node_zonelist() deal with a missing
pgdat could make sense as well. Something like:
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b976c4177299..2d2649e78766 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags)
*
* For the case of non-NUMA systems the NODE_DATA() gets optimized to
* &contig_page_data at compile-time.
+ *
+ * If the node does not have a pgdat yet, returns the zonelist of the
+ * first online node.
*/
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
+ if (unlikely(!NODE_DATA(nid)))
+ nid = first_online_node;
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}
But of course, there might be value in a proper node-aware fallback list
as we have in build_zonelists() -- but it remains questionable if the
difference for these corner cases would be relevant in practice.
Further, if we could have thousands of nodes, we'd have to update each
and every one when building zone lists ...
Removing the hotadd_new_pgdat() stuff does sound appealing, though.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists