linux-kernel - Re: readahead and oom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 28 Apr 2011 12:19:47 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Minchan Kim <minchan.kim@...il.com>,
	Dave Young <hidave.darkstar@...il.com>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Mel Gorman <mel@...ux.vnet.ibm.com>
Subject: Re: readahead and oom

On Wed, Apr 27, 2011 at 03:47:43AM +0800, Andrew Morton wrote:
> On Tue, 26 Apr 2011 17:20:29 +0800
> Wu Fengguang <fengguang.wu@...el.com> wrote:
> 
> > Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
> > 
> > readahead page allocations are completely optional. They are OK to
> > fail and in particular shall not trigger OOM on themselves.
> 
> I have distinct recollections of trying this many years ago, finding
> that it caused problems then deciding not to do it.  But I can't find
> an email trail and I don't remember the reasons :(

The most possible reason can be page allocation failures even if there
are plenty of _global_ reclaimable pages.

> If the system is so stressed for memory that the oom-killer might get
> involved then the readahead pages may well be getting reclaimed before
> the application actually gets to use them.  But that's just an aside.

Yes, when direct reclaim is working as expected, readahead thrashing
should happen long before NORETRY page allocation failures and OOM.

With that assumption I think it's OK to do this patch.  As for
readahead, sporadic allocation failures are acceptable. But there is a
problem, see below.

> Ho hum.  The patch *seems* good (as it did 5-10 years ago ;)) but there
> may be surprising side-effects which could be exposed under heavy
> testing.  Testing which I'm sure hasn't been performed...

The NORETRY direct reclaim does tend to fail a lot more on concurrent
reclaims, where one task's reclaimed pages can be stoled by others
before it's able to get it.

        __alloc_pages_direct_reclaim()
        {
                did_some_progress = try_to_free_pages();

                // pages stolen by others

                page = get_page_from_freelist();
        }

Here are the tests to demonstrate this problem.

Out of 1000GB reads and page allocations,

        test-ra-thrash.sh: read 1000 1G files interleaved in 1 single task:

        nr_alloc_fail 733

        test-dd-sparse.sh: read 1000 1G files concurrently in 1000 tasks:

        nr_alloc_fail 11799


Thanks,
Fengguang
---

--- linux-next.orig/include/linux/mmzone.h	2011-04-27 21:58:27.000000000 +0800
+++ linux-next/include/linux/mmzone.h	2011-04-27 21:58:39.000000000 +0800
@@ -106,6 +106,7 @@ enum zone_stat_item {
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_ALLOC_FAIL,
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
--- linux-next.orig/mm/page_alloc.c	2011-04-27 21:58:27.000000000 +0800
+++ linux-next/mm/page_alloc.c	2011-04-27 21:58:39.000000000 +0800
@@ -2176,6 +2176,8 @@ rebalance:
 	}
 
 nopage:
+	inc_zone_state(preferred_zone, NR_ALLOC_FAIL);
+	/* count_zone_vm_events(PGALLOCFAIL, preferred_zone, 1 << order); */
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		unsigned int filter = SHOW_MEM_FILTER_NODES;
 
--- linux-next.orig/mm/vmstat.c	2011-04-27 21:58:27.000000000 +0800
+++ linux-next/mm/vmstat.c	2011-04-27 21:58:53.000000000 +0800
@@ -879,6 +879,7 @@ static const char * const vmstat_text[] 
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
+	"nr_alloc_fail",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",

Download attachment "test-dd-sparse.sh" of type "application/x-sh" (126 bytes)

Download attachment "test-ra-thrash.sh" of type "application/x-sh" (115 bytes)