linux-kernel - Re: [PATCH v7 0/7] Introduce ZONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170502040129.GA27335@js1304-desktop>
Date:   Tue, 2 May 2017 13:01:32 +0900
From:   Joonsoo Kim <js1304@...il.com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Rik van Riel <riel@...hat.com>,
        Johannes Weiner <hannes@...xchg.org>,
        mgorman@...hsingularity.net, Laura Abbott <lauraa@...eaurora.org>,
        Minchan Kim <minchan@...nel.org>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        Michal Nazarewicz <mina86@...a86.com>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Russell King <linux@...linux.org.uk>,
        Will Deacon <will.deacon@....com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, kernel-team@....com
Subject: Re: [PATCH v7 0/7] Introduce ZONE_CMA

On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
> > On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> > > On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> > > > On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > > > > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> [...]
> > > not for free. For most common configurations where we have ZONE_DMA,
> > > ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> > > consumed so a new zone will need a new one AFAICS.
> > 
> > Yes, it requires one more bit for a new zone and it's handled by the patch.
> 
> I am pretty sure that you are aware that consuming new page flag bits
> is usually a no-go and something we try to avoid as much as possible
> because we are in a great shortage there. So there really have to be a
> _strong_ reason if we go that way. My current understanding that the
> whole zone concept is more about a more convenient implementation rather
> than a fundamental change which will solve unsolvable problems with the
> current approach. More on that below.

If there is a consensus that adding a new zone and one more bit in
page flags bits seems to be unreasonable, I try to find a way to use
ZONE_MOVABLE. As mentioned before, that's not fundamental issue to me.
However, it will have many potential problems as I mentioned so I
*really* don't prefer that way.

> 
> [...]
> > MOVABLE allocation will fallback as following sequence.
> > 
> > ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...
> > 
> > I don't understand what you mean CMA allocation. In MM's context,
> > there is no CMA allocation. That is just MOVABLE allocation.
> > 
> > For device's context, there is CMA allocation. It is range specific
> > allocation so it should be succeed for requested range. No fallback is
> > allowed in this case.
> 
> OK. that answers my question. I guess... My main confusion comes from
> __alloc_gigantic_page which shares alloc_contig_range with the cma
> allocation. But from what you wrote above and my quick glance over the
> code __alloc_gigantic_page simply changes the migrate type of the pfn
> range and it doesn't move it to the zone CMA. Right?

Yes, it doesn't move it to the zone CMA.

> 
> [...]
> > > > At a glance, special migratetype sound natural. I also did. However,
> > > > it's not natural in implementation POV. Zone consists of the same type
> > > > of memory (by definition ?) and MM subsystem is implemented with that
> > > > assumption. If difference type of memory shares the same zone, it easily
> > > > causes the problem and CMA problems are the such case.
> > > 
> > > But this is not any different from the highmem vs. lowmem problems we
> > > already have, no? I have looked at your example in the cover where you
> > > mention utilization and the reclaim problems. With the node reclaim we
> > > will have pages from all zones on the same LRU(s). isolate_lru_pages
> > > will skip those from ZONE_CMA because their zone_idx is higher than
> > > gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
> > > the pageblock migrate type. So the zone doesn't really help much. Or is
> > > there some aspect that I am missing?
> > 
> > Your understanding is correct. It can archieved by an explict check
> > for migratetype. And, this is the main reason that we should avoid
> > such approach.
> > 
> > With ZONE approach, all these things are done naturally. We don't need
> > any explicit check to anywhere. We already have a code to skip to
> > reclaim such pages by checking zone_idx.
> 
> Yes, and as we have to filter pages anyway doing so for cma blocks
> doesn't sound overly burdensome from the maintenance point of view.
>  
> > However, with MIGRATETYPE approach, all these things *cannot* be done
> > naturally. We need extra checks to all the places (allocator fast
> > path, reclaim path, compaction, etc...). It is really error-prone and
> > it already causes many problems due to this aspect. For the
> > performance wise, this approach is also bad since it requires to check
> > migratetype for each pages.
> > 
> > Moreover, even if we adds extra checks, things cannot be easily
> > perfect.
> 
> I see this point and I agree that using a specific zone might be a
> _nicer_ solution in the end but you have to consider another aspects as
> well. The main one I am worried about is a long term maintainability.
> We are really out of page flags and consuming one for a rather specific
> usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> of it and so we have that memory laying around unused all the time
> and blocking one page flag bit. CMA falls into a similar category
> AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> allocations in few years, yet we will have to fight to get rid of it
> like we do with ZONE_DMA. And not only that. We will also have to fight
> finding page flags for other more general usecases in the meantime.

This maintenance problem is inherent. This problem exists even if we
uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
future HW will not need CMA allocation in few years. The only
difference is that one takes single zone bit only for CMA user and the
other approach takes many hooks that we need to take care about it all
the time.

> 
> > See 3) Atomic allocation failure problem. It's inherent
> > problem if we have different types of memory in a single zone.
> > We possibly can make things perfect even with MIGRATETYPE approach,
> > however, it requires additional checks in hotpath than current. It's
> > expensive and undesirable. It will make future maintenance of MM code
> > much difficult.
> 
> I believe that the overhead in the hot path is not such a big deal. We
> have means to make it 0 when CMA is not used by jumplabels. I assume
> that the vast majority of systems will not use CMA. Those systems which
> use CMA should be able to cope with some slight overhead IMHO.

Please don't underestimate number of CMA user. Most of android device
uses CMA. So, there would be more devices using CMA than the server
not using CMA. They also have a right to experience the best performance.

> 
> I agree that the code maintenance cost is not free. And that is a valid
> concern. CMA maintenance will not be for free in either case, though (if
> for nothing else the page flags space mentioned above). Let's see what
> what this means for mm/page_alloc.c
> 
>  mm/page_alloc.c | 220 ++++++++++++++++++++++++++++----------------------------
>  1 file changed, 109 insertions(+), 111 deletions(-)
> 
> Not very convincing at first glance but this can be quite misleading as
> you have already mentioned because you have moved a lot of code to to
> init path. So let's just focus on the allocator hot paths
> 
> @@ -800,7 +805,7 @@ static inline void __free_one_page(struct page *page,
>  
>  	VM_BUG_ON(migratetype == -1);
>  	if (likely(!is_migrate_isolate(migratetype)))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
>  
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> @@ -1804,25 +1831,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
>  	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
>  	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
>  	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
> -#ifdef CONFIG_CMA
> -	[MIGRATE_CMA]         = { MIGRATE_TYPES }, /* Never used */
> -#endif
>  #ifdef CONFIG_MEMORY_ISOLATION
>  	[MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
>  #endif
>  };
>  
> -#ifdef CONFIG_CMA
> -static struct page *__rmqueue_cma_fallback(struct zone *zone,
> -					unsigned int order)
> -{
> -	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
> -}
> -#else
> -static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
> -					unsigned int order) { return NULL; }
> -#endif
> -
>  /*
>   * Move the free pages in a range to the free lists of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
> @@ -2090,8 +2103,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
>  
>  	/* Yoink! */
>  	mt = get_pageblock_migratetype(page);
> -	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)
> -	    && !is_migrate_cma(mt)) {
> +	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)) {
>  		zone->nr_reserved_highatomic += pageblock_nr_pages;
>  		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
>  		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
> @@ -2235,13 +2247,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
>  
>  retry:
>  	page = __rmqueue_smallest(zone, order, migratetype);
> -	if (unlikely(!page)) {
> -		if (migratetype == MIGRATE_MOVABLE)
> -			page = __rmqueue_cma_fallback(zone, order);
> -
> -		if (!page && __rmqueue_fallback(zone, order, migratetype))
> -			goto retry;
> -	}
> +	if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype))
> +		goto retry;
>  
>  	trace_mm_page_alloc_zone_locked(page, order, migratetype);
>  	return page;
> @@ -2283,9 +2290,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			list_add_tail(&page->lru, list);
>  		list = &page->lru;
>  		alloced++;
> -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> -					      -(1 << order));
>  	}
>  
>  	/*
> @@ -2636,10 +2640,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  		 * exists.
>  		 */
>  		watermark = min_wmark_pages(zone) + (1UL << order);
> -		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
> +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
>  			return 0;
>  
> -		__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
>  	}
>  
>  	/* Remove page from free list */
> @@ -2655,8 +2659,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  		struct page *endpage = page + (1 << order) - 1;
>  		for (; page < endpage; page += pageblock_nr_pages) {
>  			int mt = get_pageblock_migratetype(page);
> -			if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
> -			    && !is_migrate_highatomic(mt))
> +			if (!is_migrate_isolate(mt) &&
> +				!is_migrate_highatomic(mt))
>  				set_pageblock_migratetype(page,
>  							  MIGRATE_MOVABLE);
>  		}
> @@ -2783,8 +2787,7 @@ struct page *rmqueue(struct zone *preferred_zone,
>  	spin_unlock(&zone->lock);
>  	if (!page)
>  		goto failed;
> -	__mod_zone_freepage_state(zone, -(1 << order),
> -				  get_pcppage_migratetype(page));
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
>  
>  	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>  	zone_statistics(preferred_zone, zone);
> @@ -2907,12 +2910,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	else
>  		min -= min / 4;
>  
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> -
>  	/*
>  	 * Check watermarks for an order-0 allocation request. If these
>  	 * are not met, then a high-order request also cannot go ahead
> @@ -2940,13 +2937,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  			if (!list_empty(&area->free_list[mt]))
>  				return true;
>  		}
> -
> -#ifdef CONFIG_CMA
> -		if ((alloc_flags & ALLOC_CMA) &&
> -		    !list_empty(&area->free_list[MIGRATE_CMA])) {
> -			return true;
> -		}
> -#endif
>  	}
>  	return false;
>  }
> @@ -2962,13 +2952,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  		unsigned long mark, int classzone_idx, unsigned int alloc_flags)
>  {
>  	long free_pages = zone_page_state(z, NR_FREE_PAGES);
> -	long cma_pages = 0;
> -
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
>  
>  	/*
>  	 * Fast check for order-0 only. If this fails then the reserves
> @@ -2977,7 +2960,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  	 * the caller is !atomic then it'll uselessly search the free
>  	 * list. That corner case is then slower but it is harmless.
>  	 */
> -	if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
> +	if (!order && free_pages > mark + z->lowmem_reserve[classzone_idx])
>  		return true;
>  
>  	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> @@ -3547,10 +3530,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	} else if (unlikely(rt_task(current)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
> -#ifdef CONFIG_CMA
> -	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> -		alloc_flags |= ALLOC_CMA;
> -#endif
>  	return alloc_flags;
>  }
>  
> @@ -3972,9 +3951,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	if (should_fail_alloc_page(gfp_mask, order))
>  		return false;
>  
> -	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
> -		*alloc_flags |= ALLOC_CMA;
> -
>  	return true;
>  }
>  
> This looks like a nice clean up. Those ifdefs are ugly as hell. One
> could argue that some of that could be cleaned up by simply adding some
> helpers (with a jump label to reduce the overhead), though. But is this
> really strong enough reason to bring the whole zone in? I am not really
> convinced to be honest.

Please don't underestimate the benefit of this patchset.
I have said that we need *more* hooks to fix all the problems.

And, please think that this code removal is not only code removal but
also concept removal. With this removing, we don't need to consider
ALLOC_CMA for alloc_flags when calling zone_watermark_ok(). There are
many bugs on it and it still remains. We don't need to consider
pageblock migratetype when handling pageblock migratetype. We don't
need to take a great care about calculating the number of CMA
freepages.

>  
> [...]
> 
> > > Please do _not_ take this as a NAK from me. At least not at this time. I
> > > am still trying to understand all the consequences but my intuition
> > > tells me that building on top of highmem like approach will turn out to
> > > be problematic in future (as we have already seen with the highmem and
> > > movable zones) so this needs a very prudent consideration.
> > 
> > I can understand that you are prudent to this issue. However, it takes more
> > than two years and many people already expressed that ZONE approach is the
> > way to go.
> 
> I can see a single Acked-by and one Reviewed-by. It would be much more
> convincing to see much larger support. Do not take me wrong I am not
> trying to undermine the feedback so far but we should be clear about one
> thing. CMA is mostly motivated by the industry which tries to overcome
> HW limitations which can change in future very easily. I would rather
> see good enough solution for something like that than a nicer solution
> which is pushing additional burden on more general usecases.

First of all, current MIGRATETYPE approach isn't good enough to me.
They caused too many problems and there are many remanining problems.
It will causes maintenance issue for a long time.

And, although good enough solution can be better than nicer solution
in some cases, it looks like current situation isn't that case.
There is a single reason, saving page flag bit, to support good enough
solution.

I'd like to ask reversly. Is this a enough reason to make CMA user to
suffer from bugs? Is it a enough reason to maintain such dirty and
hard maintainable code, and to hurt other MM devepoler's development
performance? We need a great care on handling pageblock migratetype,
accounting freepage, reclaiming the allocated page. It would affect
many developers.

> 
> That being said, I would like to see a much larger consensus in the MM
> community before a new zone is merged. I am staying very skeptical this
> is the right direction though.

As you know, CMA is actively used in industry and they don't review
the patch in public mailing list. Moreover, there is no interest on
CMA by core MM developers. So, I think that 1 acked-by and 1
reviewed-by is a great agreement. I and they are most of developers
who have interest on CMA in MM community.

Anyway, AFAIK, following people are in favor of this patchset.
If I am wrong, please let me know.

Minchan Kim <minchan@...nel.org>
Laura Abbot <lauraa@...eaurora.org>
Bob Liu <liubo95@...wei.com>

As you know (?), Minchan and Laura are also the core developers to CMA
topic.

Thanks.