[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sg1an1je.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Tue, 22 Jun 2021 09:14:29 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Zi Yan <ziy@...dia.com>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, Yang Shi <shy828301@...il.com>,
Michal Hocko <mhocko@...e.com>, Wei Xu <weixugc@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Dan Williams <dan.j.williams@...el.com>,
"David Hildenbrand" <david@...hat.com>,
osalvador <osalvador@...e.de>
Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node
migration order
Zi Yan <ziy@...dia.com> writes:
> On 19 Jun 2021, at 4:18, Huang, Ying wrote:
>
>> Zi Yan <ziy@...dia.com> writes:
>>
>>> On 18 Jun 2021, at 2:15, Huang Ying wrote:
[snip]
>>>> +/*
>>>> + * When memory fills up on a node, memory contents can be
>>>> + * automatically migrated to another node instead of
>>>> + * discarded at reclaim.
>>>> + *
>>>> + * Establish a "migration path" which will start at nodes
>>>> + * with CPUs and will follow the priorities used to build the
>>>> + * page allocator zonelists.
>>>> + *
>>>> + * The difference here is that cycles must be avoided. If
>>>> + * node0 migrates to node1, then neither node1, nor anything
>>>> + * node1 migrates to can migrate to node0.
>>>> + *
>>>> + * This function can run simultaneously with readers of
>>>> + * node_demotion[]. However, it can not run simultaneously
>>>> + * with itself. Exclusion is provided by memory hotplug events
>>>> + * being single-threaded.
>>>> + */
>>>> +static void __set_migration_target_nodes(void)
>>>> +{
>>>> + nodemask_t next_pass = NODE_MASK_NONE;
>>>> + nodemask_t this_pass = NODE_MASK_NONE;
>>>> + nodemask_t used_targets = NODE_MASK_NONE;
>>>> + int node;
>>>> +
>>>> + /*
>>>> + * Avoid any oddities like cycles that could occur
>>>> + * from changes in the topology. This will leave
>>>> + * a momentary gap when migration is disabled.
>>>> + */
>>>> + disable_all_migrate_targets();
>>>> +
>>>> + /*
>>>> + * Ensure that the "disable" is visible across the system.
>>>> + * Readers will see either a combination of before+disable
>>>> + * state or disable+after. They will never see before and
>>>> + * after state together.
>>>> + *
>>>> + * The before+after state together might have cycles and
>>>> + * could cause readers to do things like loop until this
>>>> + * function finishes. This ensures they can only see a
>>>> + * single "bad" read and would, for instance, only loop
>>>> + * once.
>>>> + */
>>>> + smp_wmb();
>>>> +
>>>> + /*
>>>> + * Allocations go close to CPUs, first. Assume that
>>>> + * the migration path starts at the nodes with CPUs.
>>>> + */
>>>> + next_pass = node_states[N_CPU];
>>>
>>> Is there a plan of allowing user to change where the migration
>>> path starts? Or maybe one step further providing an interface
>>> to allow user to specify the demotion path. Something like
>>> /sys/devices/system/node/node*/node_demotion.
>>
>> I don't think that's necessary at least for now. Do you know any real
>> world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node.
> For the GPU workloads with data size greater than GPU memory size,
> it will be very helpful to allow pages in GPU memory to be migrated/demoted
> to CPU memory. With your current assumption, GPU memory -> CPU memory
> demotion seems not possible, right? This should also apply to any
> system with a device memory exposed as a NUMA node and workloads running
> on the device and using CPU memory as a lower tier memory than the device
> memory.
Thanks a lot for your use case! It appears that the demotion path
specified by users is one possible way to satisfy your requirement. And
I think it's possible to enable that on top of this patchset. But we
still have no specific plan to work on that at least for now.
Best Regards,
Huang, Ying
Powered by blists - more mailing lists