linux-kernel - Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150731082221.GD5840@techsingularity.net>
Date:	Fri, 31 Jul 2015 09:22:21 +0100
From:	Mel Gorman <mgorman@...hsingularity.net>
To:	Vlastimil Babka <vbabka@...e.cz>
Cc:	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Linux-MM <linux-mm@...ck.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Rik van Riel <riel@...hat.com>,
	Pintu Kumar <pintu.k@...sung.com>,
	Xishi Qiu <qiuxishi@...wei.com>, Gioh Kim <gioh.kim@....com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order
 atomic allocations on demand

On Fri, Jul 31, 2015 at 09:25:13AM +0200, Vlastimil Babka wrote:
> On 07/31/2015 09:11 AM, Mel Gorman wrote:
> >On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> >>Hello, Mel.
> >>
> >>On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> >>>From: Mel Gorman <mgorman@...e.de>
> >>>
> >>>High-order watermark checking exists for two reasons --  kswapd high-order
> >>>awareness and protection for high-order atomic requests. Historically we
> >>>depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> >>>pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> >>>that reserves pageblocks for high-order atomic allocations. This is expected
> >>>to be more reliable than MIGRATE_RESERVE was.
> >>
> >>I have some concerns on this patch.
> >>
> >>1) This patch breaks intention of __GFP_WAIT.
> >>__GFP_WAIT is used when we want to succeed allocation even if we need
> >>to do some reclaim/compaction work. That implies importance of
> >>allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> >>atomic allocation (~__GFP_WAIT) more successful than allocation with
> >>__GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> >>and doesn't make any sense.
> >>
> >
> >Currently allocation requests that do not specify __GFP_WAIT get the
> >ALLOC_HARDER flag which allows them to dip further into watermark reserves.
> >It already is the case that there are corner cases where a high atomic
> >allocation can succeed when a non-atomic allocation would reclaim.
> 
> I think (and said so before elsewhere) is that the problem is that we don't
> currently distinguish allocations that can't wait (=are really atomic and
> have no order-0 fallback) and allocations that just don't want to wait
> (=they have fallbacks). The second ones should obviously not access the
> current ALLOC_HARDER watermark-based reserves nor the proposed highatomic
> reserves.
> 

It's a separate issue though.  There are a number of cases

1. can't wait because a spinlock is held or in interrupt
2. does not want to wait because a fallback option is available
3. does not want to wait or wake kswapd because a fallback option is available
4. should not fail as it would cause major difficulties
5. cannot fail because it's a functional failure

5 is never meant to occur might be the situation on embedded platforms.
If this is the case then they should consider modifying MIGRATE_HIGHATOMIC
in this patch series but I don't think it belongs in mainline as it has
other consequences. 1-4 are a separate series.

Right now, this series does not drastically alter the concept that in
some cases atomic allocations will succeed without delay when a !atomic
allocation would have to reclaim.

There is certainly value to ironing out 1-4 on top and teaching SLUB,
THP and networking the distinction.

> Well we do look at __GFP_NO_KSWAPD flag to treat allocation as non-atomic,
> so that covers THP allocations and two drivers. But the recent networking
> commit fb05e7a89f50 didn't add the flag and nor does Joonsoo's slub patch
> use it. Either we should rename the flag and employ it where appropriate, or
> agree that access to reserves is orthogonal concern to waking up kswapd, and
> distinguish non-atomic non-__GFP_WAIT allocations differently.
> 

Separate problem with a separate series. This one is about removing
the zonelist cache due to complexity and removing an odd anomaly where
allocations can fail due to how watermarks are calculated.

> >>>A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> >>>a pageblock but limits the total number to 10% of the zone.
> >>
> >>When steals happens, pageblock already can be fragmented and we can't
> >>fully utilize this pageblock without allowing order-0 allocation. This
> >>is very waste.
> >>
> >
> >If the pageblock was stolen, it implies there was at least 1 usable page
> >of the correct order. As the pageblock is then reserved, any pages that
> >free in that block stay free for use by high-order atomic allocations.
> >Else, the number of pageblocks will increase again until the 10% limit
> >is hit.
> 
> It's however true that many of the "any pages free in that block" may be
> order-0, so they both won't be useful to high-order atomic allocations, and
> won't be available to other allocations, so they might remain unused.

I typoed lightly and missed a letter but the same outcome applies when
slightly corrected -- any pages that are *freed* in that block stay free
if it merges with buddies for use by high-order atomic allocations.
 or else, the number of pageblocks will increase again until the 10%
limit is hit.

If the limit is hit and we are still failing then it's no different to
what can happen today except it took a lot longer and was a lot harder to
trigger. As the changelog pointed out, with this approach the allocation
failure rate was massively reduced but not eliminated.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/