linux-kernel - Re: [PATCH 1/5] mm: Add __GFP_NO_OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090508095010.GA26398@localhost>
Date:	Fri, 8 May 2009 17:50:10 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	"Rafael J. Wysocki" <rjw@...k.pl>
Cc:	David Rientjes <rientjes@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-pm@...ts.linux-foundation.org" 
	<linux-pm@...ts.linux-foundation.org>,
	"pavel@....cz" <pavel@....cz>,
	"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>,
	"alan-jenkins@...fmail.co.uk" <alan-jenkins@...fmail.co.uk>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"kernel-testers@...r.kernel.org" <kernel-testers@...r.kernel.org>
Subject: Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag

On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Andrew Morton wrote:
> > 
> > > The setting and clearing of that thing looks gruesomely racy..
> > > 
> > 
> > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED 
> > gets test/set and cleared atomically for the entire zonelist (the clear 
> > happens for the same zonelist that was test/set).
> > 
> > Using it for hibernation in the way I've proposed will open it up to the 
> > race I earlier described: when a kthread is in the oom killer and 
> > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are 
> > frozen so they can't be in the oom killer).  That's perfectly acceptable, 
> > however, since the system is by definition already oom if kthreads can't 
> > get memory so it will end up killing a user task even though it's stuck in 
> > D state and will exit on thaw; we aren't concerned about killing 
> > needlessly because the oom killer becomes a no-op when it finds a task 
> > that has already been killed but hasn't exited by way of TIF_MEMDIE.
> 
> OK there.
> 
> So everyone seems to agree we can do something like in the patch below?
> 
> ---
> From: Rafael J. Wysocki <rjw@...k.pl>
> Subject: PM/Hibernate: Rework shrinking of memory
> 
> Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> just once to make some room for the image and then allocates memory
> to apply more pressure to the memory management subsystem, if
> necessary.

Thanks! Reducing to single-pass helps memory bounty laptops considerably :)

> Unfortunately, we don't seem to be able to drop shrink_all_memory()
> entirely just yet, because that would lead to huge performance
> regressions in some test cases.

Yes, but it's not the fault of this patch. In fact some regressions
may even be positive pressures to the page allocate/reclaim code ;)

> Signed-off-by: Rafael J. Wysocki <rjw@...k.pl>
> ---
>  kernel/power/snapshot.c |  151 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 104 insertions(+), 47 deletions(-)
> 
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,69 +1066,126 @@ void swsusp_free(void)
>  	buffer = NULL;
>  }
>  
> +/* Helper functions used for the shrinking of memory. */
> +
>  /**
> - *	swsusp_shrink_memory -  Try to free as much memory as needed
> - *
> - *	... but do not OOM-kill anyone
> + * preallocate_image_memory - Allocate given number of page frames
> + * @nr_pages: Number of page frames to allocate
>   *
> - *	Notice: all userland should be stopped before it is called, or
> - *	livelock is possible.
> + * Return value: Number of page frames actually allocated
>   */
> -
> -#define SHRINK_BITE	10000
> -static inline unsigned long __shrink_memory(long tmp)
> +static unsigned long preallocate_image_memory(unsigned long nr_pages)
>  {
> -	if (tmp > SHRINK_BITE)
> -		tmp = SHRINK_BITE;
> -	return shrink_all_memory(tmp);
> +	unsigned long nr_alloc = 0;
> +
> +	while (nr_pages > 0) {
> +		if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> +			break;
> +		nr_pages--;
> +		nr_alloc++;
> +	}
> +
> +	return nr_alloc;
>  }
>  
> +/**
> + * swsusp_shrink_memory -  Make the kernel release as much memory as needed
> + *
> + * To create a hibernation image it is necessary to make a copy of every page
> + * frame in use.  We also need a number of page frames to be free during
> + * hibernation for allocations made while saving the image and for device
> + * drivers, in case they need to allocate memory from their hibernation
> + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> + * respectively, both of which are rough estimates).  To make this happen, we
> + * compute the total number of available page frames and allocate at least
> + *
> + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> + *
> + * of them, which corresponds to the maximum size of a hibernation image.
> + *
> + * If image_size is set below the number following from the above formula,
> + * the preallocation of memory is continued until the total number of page
> + * frames in use is below the requested image size or it is impossible to
> + * allocate more memory, whichever happens first.
> + */
>  int swsusp_shrink_memory(void)
>  {
> -	long tmp;
>  	struct zone *zone;
> -	unsigned long pages = 0;
> -	unsigned int i = 0;
> -	char *p = "-\\|/";
> +	unsigned long saveable, size, max_size, count, pages = 0;
>  	struct timeval start, stop;
> +	int error = 0;
>  
> -	printk(KERN_INFO "PM: Shrinking memory...  ");
> +	printk(KERN_INFO "PM: Shrinking memory ... ");
>  	do_gettimeofday(&start);
> -	do {
> -		long size, highmem_size;
>  
> -		highmem_size = count_highmem_pages();
> -		size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> -		tmp = size;
> -		size += highmem_size;
> -		for_each_populated_zone(zone) {
> -			tmp += snapshot_additional_pages(zone);
> -			if (is_highmem(zone)) {
> -				highmem_size -=
> -					zone_page_state(zone, NR_FREE_PAGES);
> -			} else {
> -				tmp -= zone_page_state(zone, NR_FREE_PAGES);
> -				tmp += zone->lowmem_reserve[ZONE_NORMAL];
> -			}
> -		}
> +	/* Count the number of saveable data pages. */
> +	saveable = count_data_pages() + count_highmem_pages();
>  
> -		if (highmem_size < 0)
> -			highmem_size = 0;
> +	/*
> +	 * Compute the total number of page frames we can use (count) and the
> +	 * number of pages needed for image metadata (size).
> +	 */
> +	count = saveable;
> +	size = 0;
> +	for_each_populated_zone(zone) {
> +		size += snapshot_additional_pages(zone);
> +		count += zone_page_state(zone, NR_FREE_PAGES);
> +		count -= zone->pages_min;

I'd prefer to be more safe, by removing the above line...

> +	}

...and add another line here:

        count -= totalreserve_pages;


But hey, that 'count' counts "savable+free" memory.
We don't have a counter for an estimation of "free+freeable" memory,
ie. we are sure we cannot preallocate above that threshold. 

One applicable situation is, when there are 800M anonymous memory,
but only 500M image_size and no swap space.

In that case we will otherwise goto the oom code path. Sure oom is
(and shall be) reliably disabled in hibernation, but still we shall be
cautious enough not to create a low memory situation, which will hurt:
- hibernation speed
  (vmscan goes mad trying to squeeze the last free page)
- user experiences after resume
  (all *active* file data and metadata have to reloaded)

The current code simply tries *too hard* to meet image_size.
I'd rather take that as a mild advice, and to only free
"free+freeable-margin" pages when image_size is not approachable.

The safety margin can be totalreserve_pages, plus enough pages for
retaining the "hard core working set".

Thanks,
Fengguang

> -		tmp += highmem_size;
> -		if (tmp > 0) {
> -			tmp = __shrink_memory(tmp);
> -			if (!tmp)
> -				return -ENOMEM;
> -			pages += tmp;
> -		} else if (size > image_size / PAGE_SIZE) {
> -			tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
> -			pages += tmp;
> -		}
> -		printk("\b%c", p[i++%4]);
> -	} while (tmp > 0);
> +	/* Compute the maximum number of saveable pages to leave in memory. */
> +	max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
> +	size = DIV_ROUND_UP(image_size, PAGE_SIZE);
> +	if (size > max_size)
> +		size = max_size;
> +	/*
> +	 * If the maximum is not less than the current number of saveable pages
> +	 * in memory, we don't need to do anything more.
> +	 */
> +	if (size >= saveable)
> +		goto out;
> +
> +	/*
> +	 * Let the memory management subsystem know that we're going to need a
> +	 * large number of page frames to allocate and make it free some memory.
> +	 * NOTE: If this is not done, performance is heavily affected in some
> +	 * test cases.
> +	 */
> +	shrink_all_memory(saveable - size);
> +
> +	/*
> +	 * Prevent the OOM killer from triggering while we're allocating image
> +	 * memory.
> +	 */
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +	/*
> +	 * The number of saveable pages in memory was too high, so apply some
> +	 * pressure to decrease it.  First, make room for the largest possible
> +	 * image and fail if that doesn't work.  Next, try to decrease the size
> +	 * of the image as much as indicated by image_size.
> +	 */
> +	count -= max_size;
> +	pages = preallocate_image_memory(count);
> +	if (pages < count)
> +		error = -ENOMEM;
> +	else
> +		pages += preallocate_image_memory(max_size - size);
> +
> +	for_each_populated_zone(zone)
> +		zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
> +	/* Release all of the preallocated page frames. */
> +	swsusp_free();
> +
> +	if (error) {
> +		printk(KERN_CONT "\n");
> +		return error;
> +	}
> +
> + out:
>  	do_gettimeofday(&stop);
> -	printk("\bdone (%lu pages freed)\n", pages);
> +	printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
>  	swsusp_show_speed(&start, &stop, pages, "Freed");
>  
>  	return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/