[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkouJkixT0X_uGTrFj_qCyYikpr2j3LOo50rsY_P9OS8Xw@mail.gmail.com>
Date: Wed, 2 Nov 2022 13:08:08 -0700
From: Yang Shi <shy828301@...il.com>
To: "Zach O'Keefe" <zokeefe@...gle.com>
Cc: Michal Hocko <mhocko@...e.com>, akpm@...ux-foundation.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Andrew Davidoff <davidoff@...mf.net>,
Bob Liu <lliubbo@...il.com>
Subject: Re: [PATCH] mm: don't warn if the node is offlined
On Wed, Nov 2, 2022 at 11:59 AM Zach O'Keefe <zokeefe@...gle.com> wrote:
>
> On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <shy828301@...il.com> wrote:
> >
> > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <mhocko@...e.com> wrote:
> > >
> > > On Wed 02-11-22 10:36:07, Yang Shi wrote:
> > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@...e.com> wrote:
> > > > >
> > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@...e.com> wrote:
> > > > > > >
> > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > > > > [...]
> > > > > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > > > > node contained > some threshold number of pages.
> > > > > > >
> > > > > > > I would simply go with something like this (not even compile tested):
> > > > > >
> > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > > > > not sure whether it is worth making the code more complicated for such
> > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > > > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > > > > be any noticeable difference.
> > > > >
> > > > > I do agree that an explicit nodes (quasi)round robin sounds over
> > > > > engineered. It makes some sense to try to target the prevalent node
> > > > > though because this code can be executed from khugepaged and therefore
> > > > > allocating with a completely different affinity than the original fault.
> > > >
> > > > Yeah, the corner case comes from the node balance code, it just tries
> > > > to balance between multiple prevalent nodes, so you agree to remove it
> > > > IIRC?
> > >
> > > Yeah, let's just collect all good nodes into a nodemask and keep
> > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control
> > > so that you allocate it only once in the struct lifetime.
> >
> > Actually my intention is more aggressive, just remove that node balance code.
> >
>
> The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp:
> khugepaged: add policy for finding target node") where it was made to
> satisfy "numactl --interleave=all". I don't know why any real
> workloads would want this -- but there very well could be a valid use
> case. If not, I think it could be removed independent of what we do
> with __GFP_THISNODE and nodemask.
Hmm... if the code is used for interleave, I don't think nodemask
could preserve the behavior IIUC. The nodemask also tries to allocate
memory from the preferred node, and fallback to the allowed nodes from
nodemask when the allocation fails on the preferred node. But the
round robin style node balance tries to distribute the THP on the
nodes evenly.
And I just thought of __GFP_THISNODE + nodemask should not be the
right combination IIUC, right? __GFP_THISNODE does disallow any
fallback, so nodemask is actually useless.
So I think we narrowed down to two options:
1. Preserve the interleave behavior but bail out if the target node is
not online (it is also racy, but doesn't hurt)
2. Remove the node balance code entirely
>
> Balancing aside -- I haven't fully thought through what an ideal (and
> further overengineered) solution would be for numa, but one (perceived
> - not measured) issue that khugepaged might have (MADV_COLLAPSE
> doesn't have the choice) is on systems with many, many nodes with
> source pages sprinkled across all of them. Should we collapse these
> pages into a single THP from the node with the most (but could still
> be a small %) pages? Probably there are better candidates. So, maybe a
> khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something
> makes sense.
Anyway you have to allocate a THP on one node, I don't think of a
better idea to make the node selection fairer. But I'd prefer to wait
for real life usecase surfaces.
>
> > >
> > > And as mentioned in other reply it would be really nice to hide this
> > > under CONFIG_NUMA (in a standalong follow up of course).
> >
> > The hpage_collapse_find_target_node() function itself is defined under
> > CONFIG_NUMA.
> >
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
Powered by blists - more mailing lists