linux-kernel - Re: [RFC 2/3] mm, compaction: introduce kcompactd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <56A8BEE3.3010708@suse.cz>
Date:	Wed, 27 Jan 2016 13:58:11 +0100
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Mel Gorman <mgorman@...hsingularity.net>
Cc:	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Rik van Riel <riel@...hat.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	David Rientjes <rientjes@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Johannes Weiner <hannes@...xchg.org>
Subject: Re: [RFC 2/3] mm, compaction: introduce kcompactd

On 01/27/2016 10:10 AM, Mel Gorman wrote:
> On Tue, Jan 26, 2016 at 04:36:14PM +0100, Vlastimil Babka wrote:
>> Memory compaction can be currently performed in several contexts:
>> 
>> - kswapd balancing a zone after a high-order allocation failure
>> - direct compaction to satisfy a high-order allocation, including THP page
>>   fault attemps
>> - khugepaged trying to collapse a hugepage
>> - manually from /proc
>> 
>> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
>> (pending or future) high-order allocation, and is easy to evaluate. The other
>> purpose is to keep overal memory fragmentation low and help the
>> anti-fragmentation mechanism. The success wrt the latter purpose is more
>> difficult to evaluate though.
>> 
>> The current situation wrt the purposes has a few drawbacks:
>> 
>> - compaction is invoked only when a high-order page or hugepage is not
>>   available (or manually). This might be too late for the purposes of keeping
>>   memory fragmentation low.
>> - direct compaction increases latency of allocations. Again, it would be
>>   better if compaction was performed asynchronously to keep fragmentation low,
>>   before the allocation itself comes.
>> - (a special case of the previous) the cost of compaction during THP page
>>   faults can easily offset the benefits of THP.
>> - kswapd compaction appears to be complex, fragile and not working in some
>>   scenarios
>> 
> 
> An addendum to that is that kswapd can be compacting for a high-order
> allocation request when it should be reclaiming memory for an order-0
> request.

Right, thanks.

> My recollection is that kswapd compacting was meant to help atomic
> high-order allocations but I wonder if the same problem even exists with
> the revised watermark handling.

Well, certainly nobody noticed kswapd compaction being dysfunctional.

> 
>> - the target order used for kswapd is passed to kcompactd
>> 
>> The kswapd compact/reclaim loop for high-order pages will be removed in the
>> next patch with the description of what's wrong with it.
>> 
>> In this patch, kcompactd uses the standard compaction_suitable() and
>> compact_finished() criteria, which means it will most likely have nothing left
>> to do after kswapd finishes, until the next patch. Kcompactd also mimics
>> direct compaction somewhat by trying async compaction first and sync compaction
>> afterwards, and uses the deferred compaction functionality.
>> 
> 
> Why should it try async compaction first? The deferred compaction makes
> sense as kcompact will need some sort of limitation on the amount of
> CPU it can use.

I was just being conservative, but good point. Unlike direct compaction, latency
doesn't bother kcompactd.

> 
>> @@ -1759,4 +1763,227 @@ void compaction_unregister_node(struct node *node)
>>  }
>>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>>  
>> +static bool kcompactd_work_requested(pg_data_t *pgdat)
>> +{
>> +	return pgdat->kcompactd_max_order > 0;
>> +}
>> +
> 
> inline
> 
>> +static bool kcompactd_node_suitable(pg_data_t *pgdat)
>> +{
>> +	int zoneid;
>> +	struct zone *zone;
>> +
>> +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
>> +		zone = &pgdat->node_zones[zoneid];
>> +
>> +		if (!populated_zone(zone))
>> +			continue;
>> +
>> +		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
>> +					pgdat->kcompactd_classzone_idx)
>> +							== COMPACT_CONTINUE)
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
> 
> Why does this traverse all zones and not just the ones within the
> classzone_idx?

Hmm, guess I didn't revisit it after previous submission where kswapd compaction
wasn't being replaced by kcompactd. But kswapd also tries to balance higher
zones than those given by classzone_idx, if they needed. I'll rethink this.

>> +static void kcompactd_do_work(pg_data_t *pgdat)
>> +{
>> +	/*
>> +	 * With no special task, compact all zones so that a page of requested
>> +	 * order is allocatable.
>> +	 */
>> +	int zoneid;
>> +	struct zone *zone;
>> +	struct compact_control cc = {
>> +		.order = pgdat->kcompactd_max_order,
>> +		.classzone_idx = pgdat->kcompactd_classzone_idx,
>> +		.mode = MIGRATE_ASYNC,
>> +		.ignore_skip_hint = true,
>> +
>> +	};
>> +	bool success = false;
>> +
>> +	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
>> +							cc.classzone_idx);
>> +	count_vm_event(KCOMPACTD_WAKE);
>> +
>> +retry:
>> +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> 
> Again, why is classzone_idx not taken into account?

Might be worth to just do everything once we've woken up, like kswapd.
Deferred+suitable checks should prevent wasted attempts in either case.

[...]

>> +
>> +	if (!success && cc.mode == MIGRATE_ASYNC) {
>> +		cc.mode = MIGRATE_SYNC_LIGHT;
>> +		goto retry;
>> +	}
>> +
> 
> Still not getting why kcompactd should concern itself with async
> compaction. It's really direct compaction that cared and was trying to
> avoid stalls.

Right

>> +	 * Regardless of success, we are done until woken up next. But remember
>> +	 * the requested order/classzone_idx in case it was higher/tighter than
>> +	 * our current ones
>> +	 */
>> +	if (pgdat->kcompactd_max_order <= cc.order)
>> +		pgdat->kcompactd_max_order = 0;
>> +	if (pgdat->classzone_idx >= cc.classzone_idx)
>> +		pgdat->classzone_idx = pgdat->nr_zones - 1;
>> +}
>> +
>>
>> <SNIP>
>>
>> @@ -1042,7 +1043,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>>  	arg.nr_pages = nr_pages;
>>  	node_states_check_changes_online(nr_pages, zone, &arg);
>>  
>> -	nid = pfn_to_nid(pfn);
>> +	nid = zone_to_nid(zone);
>>  
>>  	ret = memory_notify(MEM_GOING_ONLINE, &arg);
>>  	ret = notifier_to_errno(ret);
>> @@ -1082,7 +1083,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>>  
>>  	if (onlined_pages) {
>> -		node_states_set_node(zone_to_nid(zone), &arg);
>> +		node_states_set_node(nid, &arg);
>>  		if (need_zonelists_rebuild)
>>  			build_all_zonelists(NULL, NULL);
>>  		else
> 
> Why are these two hunks necessary?

Just a drive-by cleanup/optimization that didn't seem worth separate patch. But
I probably should?

> 
>> @@ -1093,8 +1094,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>>  
>>  	init_per_zone_wmark_min();
>>  
>> -	if (onlined_pages)
>> -		kswapd_run(zone_to_nid(zone));
>> +	if (onlined_pages) {
>> +		kswapd_run(nid);
>> +		kcompactd_run(nid);
>> +	}
>>  
>>  	vm_total_pages = nr_free_pagecache_pages();
>>  
>> @@ -1858,8 +1861,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>  		zone_pcp_update(zone);
>>  
>>  	node_states_clear_node(node, &arg);
>> -	if (arg.status_change_nid >= 0)
>> +	if (arg.status_change_nid >= 0) {
>>  		kswapd_stop(node);
>> +		kcompactd_stop(node);
>> +	}
>>  
>>  	vm_total_pages = nr_free_pagecache_pages();
>>  	writeback_set_ratelimit();
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 63358d9f9aa9..7747eb36e789 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5212,6 +5212,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>>  #endif
>>  	init_waitqueue_head(&pgdat->kswapd_wait);
>>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
>> +#ifdef CONFIG_COMPACTION
>> +	init_waitqueue_head(&pgdat->kcompactd_wait);
>> +#endif
>>  	pgdat_page_ext_init(pgdat);
>>  
>>  	for (j = 0; j < MAX_NR_ZONES; j++) {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 72d52d3aef74..1449e21c55cc 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3408,6 +3408,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>>  		 */
>>  		reset_isolation_suitable(pgdat);
>>  
>> +		/*
>> +		 * We have freed the memory, now we should compact it to make
>> +		 * allocation of the requested order possible.
>> +		 */
>> +		wakeup_kcompactd(pgdat, order, classzone_idx);
>> +
>>  		if (!kthread_should_stop())
>>  			schedule();
>>  
> 
> This initially confused me but it's due to patch ordering. It's silly
> but when this patch is applied then both kswapd and kcompactd are
> compacting memory. I would prefer if the patches were in reverse order
> but that is purely taste.

In reverse order there would be a case where neither is compacting. Guess I'll
just move the wakeup to the next patch. The separation is mainly for making
review more tractable.

> While this was not a comprehensive review, I think the patch is ok in
> principal. While deferred compaction will keep the CPU usage under control,
> the main concern is that kcompactd consumes too much CPU but I do not
> see a case where that would trigger that kswapd would not have
> encountered already.

Thanks! On the opposite, kswapd didn't consider deferred compaction, so it could
consume too much CPU if it wasn't otherwise broken.