linux-kernel - Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55D72ABD.8020205@suse.cz>
Date:	Fri, 21 Aug 2015 15:42:21 +0200
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Mel Gorman <mgorman@...hsingularity.net>,
	Linux-MM <linux-mm@...ck.org>
Cc:	Johannes Weiner <hannes@...xchg.org>,
	Rik van Riel <riel@...hat.com>,
	David Rientjes <rientjes@...gle.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Michal Hocko <mhocko@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to
 sleep, unwilling to unwilling and avoiding waking kswapd

On 08/12/2015 12:45 PM, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min". __GFP_HIGH users get
> access to the first lower watermark and can be called the "high priority
> reserve". Atomic users and interrupts access yet another lower watermark
> that can be called the "atomic reserve".
>
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
>
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.
>
> This patch then converts a number of sites
>
> o __GFP_ATOMIC is used by callers that are high priority and have memory
>    pools for those requests. GFP_ATOMIC uses this flag. Callers with
>    interrupts disabled still automatically use the atomic reserves.

Hm I can't see where the latter happens? In gfp_to_alloc_flags(), 
ALLOC_HARDER is set for __GFP_ATOMIC, or rt-tasks *not* in interrupt? 
What am I missing?

> o Callers that have a limited mempool to guarantee forward progress use
>    __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
>    kswapd will still be woken but atomic reserves are not used as there
>    is a one-entry mempool to guarantee progress.
>
> o Callers that are checking if they are non-blocking should use the
>    helper gfpflags_allows_blocking() where possible. This is because

A bit subjective but gfpflags_allow_blocking() sounds better to me.
Or shorter gfp_allows_blocking()?

>    checking for __GFP_WAIT as was done historically now can trigger false
>    positives. Some exceptions like dm-crypt.c exist where the code intent
>    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
>    flag manipulations.
>
> The key hazard to watch out for is callers that removed __GFP_WAIT and
> was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.

Hm we might also have a (non-fatal) hazard of callers that directly 
combined __GFP_* flags that didn't include __GFP_WAIT, but did wake up 
kswapd, and now might be missing __GFP_KSWAPD_RECLAIM. Did you try 
checking for those? I imagine it's not a simple task...

> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>

>
> diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> index c46e68cf9344..6f1f6fae30f5 100644
> --- a/Documentation/vm/balance
> +++ b/Documentation/vm/balance
> @@ -1,12 +1,14 @@
>   Started Jan 2000 by Kanoj Sarcar <kanoj@....com>
>
> -Memory balancing is needed for non __GFP_WAIT as well as for non
> -__GFP_IO allocations.
> +Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +well as for non __GFP_IO allocations.
>
> -There are two reasons to be requesting non __GFP_WAIT allocations:
> -the caller can not sleep (typically intr context), or does not want
> -to incur cost overheads of page stealing and possible swap io for
> -whatever reasons.
> +The first reason why a caller may avoid reclaim is that the caller can not
> +sleep due to holding a spinlock or is in interrupt context. The second may
> +be that the caller is willing to fail the allocation without incurring the
> +overhead of page stealing. This may happen for opportunistic high-order

I think "page stealing" has nowadays a different meaning in the 
anti-fragmentation context? Should it just say "reclaim"?

> +allocation requests that have order-0 fallback options. In such cases,
> +the caller may also wish to avoid waking kswapd.
>
>   __GFP_IO allocation requests are made to prevent file system deadlocks.
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index cba12f34ff77..100d3fbaebae 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>
>   	if (is_coherent || nommu())
>   		addr = __alloc_simple_buffer(dev, size, gfp, &page);
> -	else if (!(gfp & __GFP_WAIT))
> +	else if (gfp & __GFP_ATOMIC)
>   		addr = __alloc_from_pool(size, &page);
>   	else if (!dev_get_cma_area(dev))
>   		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> @@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
>   	*handle = DMA_ERROR_CODE;
>   	size = PAGE_ALIGN(size);
>
> -	if (!(gfp & __GFP_WAIT))
> +	if (gfp & __GFP_ATOMIC)
>   		return __iommu_alloc_atomic(dev, size, handle);
>
>   	/*
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index d16a1cead23f..713d963fb96b 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
>   	if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>   	    dev->coherent_dma_mask <= DMA_BIT_MASK(32))
>   		flags |= GFP_DMA;
> -	if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> +	if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
>   		struct page *page;
>   		void *addr;
>
> @@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
>
>   	size = PAGE_ALIGN(size);
>
> -	if (!coherent && !(flags & __GFP_WAIT)) {
> +	if (!coherent && (flags & __GFP_ATOMIC)) {
>   		struct page *page = NULL;
>   		void *addr = __alloc_from_pool(size, &page, flags);
>

Hmm these change the lack of __GFP_WAIT to expect __GFP_ATOMIC, so it's 
potentially one of those "key hazards" mentioned in the changelog, 
right? But here it's not just about using atomic reserves, but using a 
completely different allocation function.
E.g. in case of arch/arm/mm/dma-mapping.c:__dma_alloc() I see it can go 
to __alloc_remap_buffer -> __dma_alloc_remap -> dma_common_contiguous_remap
which does kmalloc(..., GFP_KERNEL) and has comment "Cannot be used in 
non-sleeping contexts".

So I think callers that cannot sleep and did clear __GFP_WAIT before, 
are now dangerous unless they set __GFP_ATOMIC?

> diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> index 2a3973a7c441..dc611c8cad10 100644
> --- a/drivers/firewire/core-cdev.c
> +++ b/drivers/firewire/core-cdev.c
> @@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
>   static int add_client_resource(struct client *client,
>   			       struct client_resource *resource, gfp_t gfp_mask)
>   {
> -	bool preload = !!(gfp_mask & __GFP_WAIT);
> +	bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);

Use the helper here to avoid !! as a bonus?

> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
>
>   static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
>   {
> -	bool preload = !!(gfp_mask & __GFP_WAIT);
> +	bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
>   	unsigned long flags;
>   	int ret, id;
>

Same here.

> diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> index d51687780b61..06badad3ab75 100644
> --- a/drivers/usb/host/u132-hcd.c
> +++ b/drivers/usb/host/u132-hcd.c
> @@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>   {
>   	struct u132 *u132 = hcd_to_u132(hcd);
>   	if (irqs_disabled()) {
> -		if (__GFP_WAIT & mem_flags) {
> +		if (__GFP_DIRECT_RECLAIM & mem_flags) {
>   			printk(KERN_ERR "invalid context for function that migh"
>   				"t sleep\n");
>   			return -EINVAL;

And here - no other flag manipulations and it would match the printk.
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>   	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
>   		clear = 1;
>   again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
>   		/*
>   		 * Don't care for allocation failure here because we might end
>   		 * up not needing the pre-allocated extent state at all, which
> @@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>
>   	bits |= EXTENT_FIRST_DELALLOC;
>   again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
>   		prealloc = alloc_extent_state(mask);
>   		BUG_ON(!prealloc);
>   	}
> @@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>   	btrfs_debug_check_extent_io_range(tree, start, end);
>
>   again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
>   		/*
>   		 * Best effort, don't worry if extent state allocation fails
>   		 * here for the first iteration. We might have a cached state
> @@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
>   	u64 start = page_offset(page);
>   	u64 end = start + PAGE_CACHE_SIZE - 1;
>
> -	if ((mask & __GFP_WAIT) &&
> +	if ((mask & __GFP_DIRECT_RECLAIM) &&
>   	    page->mapping->host->i_size > 16 * 1024 * 1024) {
>   		u64 len;
>   		while (start <= end) {

Why not here as well.

> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
>   	cpumask_var_t cpus;
>   	int cpu, ret;
>
> -	might_sleep_if(gfp_flags & __GFP_WAIT);
> +	might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);
>
>   	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
>   		preempt_disable();
> diff --git a/lib/idr.c b/lib/idr.c
> index 5335c43adf46..e5118fc82961 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
>   	 * allocation guarantee.  Disallow usage from those contexts.
>   	 */
>   	WARN_ON_ONCE(in_interrupt());
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
>   	preempt_disable();
>
> @@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
>   	struct idr_layer *pa[MAX_IDR_LEVEL + 1];
>   	int id;
>
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
>   	/* sanity checks */
>   	if (WARN_ON_ONCE(start < 0))
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..cc5fdc3fb734 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>   	 * preloading in the interrupt anyway as all the allocations have to
>   	 * be atomic. So just do normal allocation when in interrupt.
>   	 */
> -	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
>   		struct radix_tree_preload *rtp;
>
>   		/*

These too?

> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index dac5bf59309d..2056d16807de 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
>   {
>   	struct bdi_writeback *wb;
>
> -	might_sleep_if(gfp & __GFP_WAIT);
> +	might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
>
>   	if (!memcg_css->parent)
>   		return &bdi->wb;

ditto

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>   		return false;
>   	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>   		return false;
> -	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> +	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))

Should __GFP_ATOMIC really be here?

> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 197c3f59ecbf..c5fcdd6f85b7 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
>   /* Set an association id for a given association */
>   int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
>   {
> -	bool preload = !!(gfp & __GFP_WAIT);
> +	bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
>   	int ret;
>
>   	/* If the id is already assigned, keep it. */

helper?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/