[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c06b4453-c533-a9ba-939a-8877fb301ad6@intel.com>
Date: Wed, 1 Jul 2020 09:48:22 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: David Rientjes <rientjes@...gle.com>,
Dave Hansen <dave.hansen@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kbusch@...nel.org, yang.shi@...ux.alibaba.com,
ying.huang@...el.com, dan.j.williams@...el.com
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of
discard
On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@...ux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?
In its current form, yes.
My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
The auto-migration only kicks in when the data is about to go away. So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.
> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.
>
> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node. Do we have such a
> restriction on migration nodes?
There's nothing explicit. On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.
If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.
>> Some places we would like to see this used:
>>
>> 1. Persistent memory being as a slower, cheaper DRAM replacement
>> 2. Remote memory-only "expansion" NUMA nodes
>> 3. Resolving memory imbalances where one NUMA node is seeing more
>> allocation activity than another. This helps keep more recent
>> allocations closer to the CPUs on the node doing the allocating.
>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.
Yeah, agreed. That's the sketchiest of the three. :)
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> + /*
>> + * 'mask' targets allocation only to the desired node in the
>> + * migration path, and fails fast if the allocation can not be
>> + * immediately satisfied. Reclaim is already active and heroic
>> + * allocation efforts are unwanted.
>> + */
>> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> + __GFP_MOVABLE;
>
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?
In my mental model, cold data flows from:
DRAM -> PMEM -> swap
Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.
...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>> ; /* try to reclaim the page below */
>> }
>>
>> + rc = migrate_demote_mapping(page);
>> + /*
>> + * -ENOMEM on a THP may indicate either migration is
>> + * unsupported or there was not enough contiguous
>> + * space. Split the THP into base pages and retry the
>> + * head immediately. The tail pages will be considered
>> + * individually within the current loop's page list.
>> + */
>> + if (rc == -ENOMEM && PageTransHuge(page) &&
>> + !split_huge_page_to_list(page, page_list))
>> + rc = migrate_demote_mapping(page);
>> +
>> + if (rc == MIGRATEPAGE_SUCCESS) {
>> + unlock_page(page);
>> + if (likely(put_page_testzero(page)))
>> + goto free_it;
>> + /*
>> + * Speculative reference will free this page,
>> + * so leave it off the LRU.
>> + */
>> + nr_reclaimed++;
>
> nr_reclaimed += nr_pages instead?
Oh, good catch. I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.
Powered by blists - more mailing lists