linux-kernel - Re: Silent hang up caused by pages being not scanned?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151014132248.GH28333@dhcp22.suse.cz>
Date:	Wed, 14 Oct 2015 15:22:48 +0200
From:	Michal Hocko <mhocko@...nel.org>
To:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc:	rientjes@...gle.com, oleg@...hat.com,
	torvalds@...ux-foundation.org, kwalker@...hat.com, cl@...ux.com,
	akpm@...ux-foundation.org, hannes@...xchg.org,
	vdavydov@...allels.com, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, skozina@...hat.com
Subject: Re: Silent hang up caused by pages being not scanned?

On Wed 14-10-15 01:19:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I can see two options here. Either we teach zone_reclaimable to be less
> > fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> > them are risky because we have a long history of changes in this areas
> > which made other subtle behavior changes but I guess that the first
> > option should be less fragile. What about the following patch? I am not
> > happy about it because the condition is rather rough and a deeper
> > inspection is really needed to check all the call sites but it should be
> > good for testing.
> 
> While zone_reclaimable() for Node 0 DMA32 became false by your patch,
> zone_reclaimable() for Node 0 DMA kept returning true, and as a result
> overall result (i.e. zones_reclaimable) remained true.

Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on
LRUs while it is still protected from allocations which are not
targeted for this zone. My patch clearly haven't considered that. The
fix for that would be quite straightforward. We have to consider
lowmem_reserve of the zone wrt. the allocation/reclaim gfp target
zone. But this is getting more and more ugly (see the patch below just
for testing/demonstration purposes).

The OOM report is really interesting:

> [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

so your whole file LRUs are either dirty or under writeback and
reclaimable pages are below min wmark. This alone is quite suspicious.
Why hasn't balance_dirty_pages throttled writers and allowed them to
make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
configuration on that system.

Also why throttle_vm_writeout haven't slown the reclaim down?

Anyway this is exactly the case where zone_reclaimable helps us to
prevent OOM because we are looping over the remaining LRU pages without
making progress... This just shows how subtle all this is :/

I have to think about this much more..
---
>From c54a894490650dd65a98a2a0efa5324ecf3de61d Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@...e.com>
Date: Tue, 13 Oct 2015 15:12:13 +0200
Subject: [PATCH] mm, vmscan: Make zone_reclaimable less fragile

zone_reclaimable considers a zone unreclaimable if we have scanned all
the reclaimable pages sufficient times since the last page has been
freed and that still hasn't led to an allocation success. This can,
however, lead to a livelock/trashing when a single freed page resets
PAGES_SCANNED while memory consumers are looping over small LRUs without
making any progress (e.g. remaining pages on the LRU are dirty and all
the flushers are blocked) and failing to invoke the OOM killer beause
zone_reclaimable would consider the zone reclaimable.

Tetsuo Handa has reported the following:
: [   66.821450] zone_reclaimable returned 1 at line 2646
: [   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
: [   66.824935] shrink_zones returned 1 at line 2706
: [   66.826392] zones_reclaimable=1 at line 2765
: [   66.827865] do_try_to_free_pages returned 1 at line 2938
: [   67.102322] __perform_reclaim returned 1 at line 2854
: [   67.103968] did_some_progress=1 at line 3301
: (...snipped...)
: [  281.439977] zone_reclaimable returned 1 at line 2646
: [  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
: [  281.439978] shrink_zones returned 1 at line 2706
: [  281.439978] zones_reclaimable=1 at line 2765
: [  281.439979] do_try_to_free_pages returned 1 at line 2938
: [  281.439979] __perform_reclaim returned 1 at line 2854
: [  281.439980] did_some_progress=1 at line 3301

In his case anon LRUs are not reclaimable because there is no swap enabled.

It is not clear who frees a page that regularly but it is clear that no
progress can be made but zone_reclaimable still considers the zone
reclaimable.

This patch makes sure that we do not follow zone_reclaimable without
prior consideration in the direct reclaim path. Reclaimable LRU lists
have to contain sufficient pages to move us over min watermark otherwise
we wouldn't be able to make progress anyway. Please note that we have to
consider lowmem reserves for each zone because ZONE_DMA is protected
from most allocations and so its LRU list might be too small to scan
enough pages to consider the zone unreclaimable.

Reported-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@...e.com>
---
 mm/vmscan.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c88d74ad9304..35a384c5bdab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2640,9 +2640,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
 			reclaimable = true;
 
-		if (global_reclaim(sc) &&
-		    !reclaimable && zone_reclaimable(zone))
-			reclaimable = true;
+		/*
+		 * Consider the current zone reclaimable even if we haven't
+		 * reclaimed anything if there are enough pages on reclaimable
+		 * LRU lists (they might be dirty or under writeback).
+		 */
+		if (global_reclaim(sc) && !reclaimable) {
+			unsigned long reclaimable = zone_reclaimable_pages(zone);
+			unsigned long free = zone_page_state(zone, NR_FREE_PAGES);
+			unsigned long reserve = zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
+
+			if (reclaimable + free < min_wmark_pages(zone) + reserve)
+				continue;
+
+			if (zone_reclaimable(zone))
+				reclaimable = true;
+		}
 	}
 
 	/*
-- 
2.5.1

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/