linux-kernel - Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150821203942.GF12432@techsingularity.net>
Date:	Fri, 21 Aug 2015 21:39:42 +0100
From:	Mel Gorman <mgorman@...hsingularity.net>
To:	Vlastimil Babka <vbabka@...e.cz>
Cc:	Linux-MM <linux-mm@...ck.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Rik van Riel <riel@...hat.com>,
	David Rientjes <rientjes@...gle.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Michal Hocko <mhocko@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable
 to sleep, unwilling to unwilling and avoiding waking kswapd

On Fri, Aug 21, 2015 at 03:42:21PM +0200, Vlastimil Babka wrote:
> On 08/12/2015 12:45 PM, Mel Gorman wrote:
> >__GFP_WAIT has been used to identify atomic context in callers that hold
> >spinlocks or are in interrupts. They are expected to be high priority and
> >have access one of two watermarks lower than "min". __GFP_HIGH users get
> >access to the first lower watermark and can be called the "high priority
> >reserve". Atomic users and interrupts access yet another lower watermark
> >that can be called the "atomic reserve".
> >
> >Over time, callers had a requirement to not block when fallback options
> >were available. Some have abused __GFP_WAIT leading to a situation where
> >an optimisitic allocation with a fallback option can access atomic reserves.
> >
> >This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> >cannot sleep and have no alternative. High priority users continue to use
> >__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> >willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> >that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> >as a caller that is willing to enter direct reclaim and wake kswapd for
> >background reclaim.
> >
> >This patch then converts a number of sites
> >
> >o __GFP_ATOMIC is used by callers that are high priority and have memory
> >   pools for those requests. GFP_ATOMIC uses this flag. Callers with
> >   interrupts disabled still automatically use the atomic reserves.
> 
> Hm I can't see where the latter happens? In gfp_to_alloc_flags(),
> ALLOC_HARDER is set for __GFP_ATOMIC, or rt-tasks *not* in interrupt? What
> am I missing?
> 

It was a mistake from an earlier version of the patch that was itself
buggy. I forgot to fix the changelog properly.

> >o Callers that have a limited mempool to guarantee forward progress use
> >   __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> >   kswapd will still be woken but atomic reserves are not used as there
> >   is a one-entry mempool to guarantee progress.
> >
> >o Callers that are checking if they are non-blocking should use the
> >   helper gfpflags_allows_blocking() where possible. This is because
> 
> A bit subjective but gfpflags_allow_blocking() sounds better to me.
> Or shorter gfp_allows_blocking()?
> 

I'll use gfpflags_allow_blocking.

> >   checking for __GFP_WAIT as was done historically now can trigger false
> >   positives. Some exceptions like dm-crypt.c exist where the code intent
> >   is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> >   flag manipulations.
> >
> >The key hazard to watch out for is callers that removed __GFP_WAIT and
> >was depending on access to atomic reserves for inconspicuous reasons.
> >In some cases it may be appropriate for them to use __GFP_HIGH.
> 
> Hm we might also have a (non-fatal) hazard of callers that directly combined
> __GFP_* flags that didn't include __GFP_WAIT, but did wake up kswapd, and
> now might be missing __GFP_KSWAPD_RECLAIM. Did you try checking for those? I
> imagine it's not a simple task...
> 

I hadn't searched but there are a small number of callers that
potentially care. I fixed them.

> >Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
> 
> >
> >diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> >index c46e68cf9344..6f1f6fae30f5 100644
> >--- a/Documentation/vm/balance
> >+++ b/Documentation/vm/balance
> >@@ -1,12 +1,14 @@
> >  Started Jan 2000 by Kanoj Sarcar <kanoj@....com>
> >
> >-Memory balancing is needed for non __GFP_WAIT as well as for non
> >-__GFP_IO allocations.
> >+Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> >+well as for non __GFP_IO allocations.
> >
> >-There are two reasons to be requesting non __GFP_WAIT allocations:
> >-the caller can not sleep (typically intr context), or does not want
> >-to incur cost overheads of page stealing and possible swap io for
> >-whatever reasons.
> >+The first reason why a caller may avoid reclaim is that the caller can not
> >+sleep due to holding a spinlock or is in interrupt context. The second may
> >+be that the caller is willing to fail the allocation without incurring the
> >+overhead of page stealing. This may happen for opportunistic high-order
> 
> I think "page stealing" has nowadays a different meaning in the
> anti-fragmentation context? Should it just say "reclaim"?
> 

Good point, corrected.

> >+allocation requests that have order-0 fallback options. In such cases,
> >+the caller may also wish to avoid waking kswapd.
> >
> >  __GFP_IO allocation requests are made to prevent file system deadlocks.
> >
> >diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> >index cba12f34ff77..100d3fbaebae 100644
> >--- a/arch/arm/mm/dma-mapping.c
> >+++ b/arch/arm/mm/dma-mapping.c
> >@@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
> >
> >  	if (is_coherent || nommu())
> >  		addr = __alloc_simple_buffer(dev, size, gfp, &page);
> >-	else if (!(gfp & __GFP_WAIT))
> >+	else if (gfp & __GFP_ATOMIC)
> >  		addr = __alloc_from_pool(size, &page);
> >  	else if (!dev_get_cma_area(dev))
> >  		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> >@@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
> >  	*handle = DMA_ERROR_CODE;
> >  	size = PAGE_ALIGN(size);
> >
> >-	if (!(gfp & __GFP_WAIT))
> >+	if (gfp & __GFP_ATOMIC)
> >  		return __iommu_alloc_atomic(dev, size, handle);
> >
> >  	/*
> >diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> >index d16a1cead23f..713d963fb96b 100644
> >--- a/arch/arm64/mm/dma-mapping.c
> >+++ b/arch/arm64/mm/dma-mapping.c
> >@@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
> >  	if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> >  	    dev->coherent_dma_mask <= DMA_BIT_MASK(32))
> >  		flags |= GFP_DMA;
> >-	if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> >+	if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
> >  		struct page *page;
> >  		void *addr;
> >
> >@@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
> >
> >  	size = PAGE_ALIGN(size);
> >
> >-	if (!coherent && !(flags & __GFP_WAIT)) {
> >+	if (!coherent && (flags & __GFP_ATOMIC)) {
> >  		struct page *page = NULL;
> >  		void *addr = __alloc_from_pool(size, &page, flags);
> >
> 
> Hmm these change the lack of __GFP_WAIT to expect __GFP_ATOMIC, so it's
> potentially one of those "key hazards" mentioned in the changelog, right?
> But here it's not just about using atomic reserves, but using a completely
> different allocation function.
> E.g. in case of arch/arm/mm/dma-mapping.c:__dma_alloc() I see it can go to
> __alloc_remap_buffer -> __dma_alloc_remap -> dma_common_contiguous_remap
> which does kmalloc(..., GFP_KERNEL) and has comment "Cannot be used in
> non-sleeping contexts".
> 

I completely missed that. It needs to be a check for
__GFP_DIRECT_RECLAIM and similar in the other arm file.

> So I think callers that cannot sleep and did clear __GFP_WAIT before, are
> now dangerous unless they set __GFP_ATOMIC?
> 
> >diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> >index 2a3973a7c441..dc611c8cad10 100644
> >--- a/drivers/firewire/core-cdev.c
> >+++ b/drivers/firewire/core-cdev.c
> >@@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
> >  static int add_client_resource(struct client *client,
> >  			       struct client_resource *resource, gfp_t gfp_mask)
> >  {
> >-	bool preload = !!(gfp_mask & __GFP_WAIT);
> >+	bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> 
> Use the helper here to avoid !! as a bonus?
> 

Done.

> >--- a/drivers/infiniband/core/sa_query.c
> >+++ b/drivers/infiniband/core/sa_query.c
> >@@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
> >
> >  static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
> >  {
> >-	bool preload = !!(gfp_mask & __GFP_WAIT);
> >+	bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> >  	unsigned long flags;
> >  	int ret, id;
> >
> 
> Same here.
> 

Done.

> >diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> >index d51687780b61..06badad3ab75 100644
> >--- a/drivers/usb/host/u132-hcd.c
> >+++ b/drivers/usb/host/u132-hcd.c
> >@@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
> >  {
> >  	struct u132 *u132 = hcd_to_u132(hcd);
> >  	if (irqs_disabled()) {
> >-		if (__GFP_WAIT & mem_flags) {
> >+		if (__GFP_DIRECT_RECLAIM & mem_flags) {
> >  			printk(KERN_ERR "invalid context for function that migh"
> >  				"t sleep\n");
> >  			return -EINVAL;
> 
> And here - no other flag manipulations and it would match the printk.

Fixed

> >--- a/fs/btrfs/extent_io.c
> >+++ b/fs/btrfs/extent_io.c
> >@@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> >  	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
> >  		clear = 1;
> >  again:
> >-	if (!prealloc && (mask & __GFP_WAIT)) {
> >+	if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> >  		/*
> >  		 * Don't care for allocation failure here because we might end
> >  		 * up not needing the pre-allocated extent state at all, which
> >@@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> >
> >  	bits |= EXTENT_FIRST_DELALLOC;
> >  again:
> >-	if (!prealloc && (mask & __GFP_WAIT)) {
> >+	if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> >  		prealloc = alloc_extent_state(mask);
> >  		BUG_ON(!prealloc);
> >  	}
> >@@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> >  	btrfs_debug_check_extent_io_range(tree, start, end);
> >
> >  again:
> >-	if (!prealloc && (mask & __GFP_WAIT)) {
> >+	if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> >  		/*
> >  		 * Best effort, don't worry if extent state allocation fails
> >  		 * here for the first iteration. We might have a cached state
> >@@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
> >  	u64 start = page_offset(page);
> >  	u64 end = start + PAGE_CACHE_SIZE - 1;
> >
> >-	if ((mask & __GFP_WAIT) &&
> >+	if ((mask & __GFP_DIRECT_RECLAIM) &&
> >  	    page->mapping->host->i_size > 16 * 1024 * 1024) {
> >  		u64 len;
> >  		while (start <= end) {
> 
> Why not here as well.
> 

Done

> >--- a/kernel/smp.c
> >+++ b/kernel/smp.c
> >@@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> >  	cpumask_var_t cpus;
> >  	int cpu, ret;
> >
> >-	might_sleep_if(gfp_flags & __GFP_WAIT);
> >+	might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);
> >
> >  	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
> >  		preempt_disable();
> >diff --git a/lib/idr.c b/lib/idr.c
> >index 5335c43adf46..e5118fc82961 100644
> >--- a/lib/idr.c
> >+++ b/lib/idr.c
> >@@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
> >  	 * allocation guarantee.  Disallow usage from those contexts.
> >  	 */
> >  	WARN_ON_ONCE(in_interrupt());
> >-	might_sleep_if(gfp_mask & __GFP_WAIT);
> >+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> >  	preempt_disable();
> >
> >@@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
> >  	struct idr_layer *pa[MAX_IDR_LEVEL + 1];
> >  	int id;
> >
> >-	might_sleep_if(gfp_mask & __GFP_WAIT);
> >+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> >  	/* sanity checks */
> >  	if (WARN_ON_ONCE(start < 0))
> >diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> >index f9ebe1c82060..cc5fdc3fb734 100644
> >--- a/lib/radix-tree.c
> >+++ b/lib/radix-tree.c
> >@@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
> >  	 * preloading in the interrupt anyway as all the allocations have to
> >  	 * be atomic. So just do normal allocation when in interrupt.
> >  	 */
> >-	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> >+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
> >  		struct radix_tree_preload *rtp;
> >
> >  		/*
> 
> These too?
> 

Yep

> >diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >index dac5bf59309d..2056d16807de 100644
> >--- a/mm/backing-dev.c
> >+++ b/mm/backing-dev.c
> >@@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
> >  {
> >  	struct bdi_writeback *wb;
> >
> >-	might_sleep_if(gfp & __GFP_WAIT);
> >+	might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
> >
> >  	if (!memcg_css->parent)
> >  		return &bdi->wb;
> 
> ditto
> 

Indeed.

> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> >  		return false;
> >  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> >  		return false;
> >-	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> >+	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> 
> Should __GFP_ATOMIC really be here?
> 

I felt it was safer because it felt in line with the intent of
alloc.ignore_gfp_wait.

> >diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> >index 197c3f59ecbf..c5fcdd6f85b7 100644
> >--- a/net/sctp/associola.c
> >+++ b/net/sctp/associola.c
> >@@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
> >  /* Set an association id for a given association */
> >  int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
> >  {
> >-	bool preload = !!(gfp & __GFP_WAIT);
> >+	bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
> >  	int ret;
> >
> >  	/* If the id is already assigned, keep it. */
> 
> helper?
> 

Yes. Thanks very much

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/