linux-kernel - Re: Silent hang up caused by pages being not scanned?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20151019125315.GF11998@dhcp22.suse.cz>
Date:	Mon, 19 Oct 2015 14:53:15 +0200
From:	Michal Hocko <mhocko@...nel.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
	David Rientjes <rientjes@...gle.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Kyle Walker <kwalker@...hat.com>,
	Christoph Lameter <cl@...ux.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Vladimir Davydov <vdavydov@...allels.com>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Stanislav Kozina <skozina@...hat.com>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>
Subject: Re: Silent hang up caused by pages being not scanned?

On Fri 16-10-15 11:34:48, Linus Torvalds wrote:
> On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@...nel.org> wrote:
> >
> > OK so here is what I am playing with currently. It is not complete
> > yet.
> 
> So this looks like it's going in a reasonable direction. However:
> 
> > +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> > +                               ac->high_zoneidx, alloc_flags, target)) {
> > +                       /* Wait for some write requests to complete then retry */
> > +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +                       goto retry;
> > +               }
> 
> I still think we should at least spend some time re-thinking that
> "wait_iff_congested()" thing.

You are right. I thought we would be congested most of the time because
of the heavy IO but a quick test has shown that the zone is marked
congested but the nr_wb_congested is zero all the time. That is most
probably because the IO is throttled severly by the lack of memory as
well.

> We may not actually be congested, but
> might be unable to write anything out because of our allocation flags
> (ie not allowed to recurse into the filesystems), so we might be in
> the situation that we have a lot of dirty pages that we can't directly
> do anything about.
> 
> Now, we will have woken kswapd, so something *will* hopefully be done
> about them eventually, but at no time do we actually really wait for
> it. We'll just busy-loop.
> 
> So at a minimum, I think we should yield to kswapd. We do do that
> "cond_resched()" in wait_iff_congested(), but I'm not entirely
> convinced that is at all enough to wait for kswapd to *do* something.

I went with congestion_wait which is what we used to do in the past
before wait_iff_congested has been introduced. The primary reason for
the change was that congestion_wait used to cause unhealthy stalls in
the direct reclaim where the bdi wasn't really congested and so we were
sleeping for the full timeout.

Now I think we can do better even with congestion_wait. We do not have
to wait when we did_some_progress so we won't affect a regular direct
reclaim path and we can reduce sleeping to:

dirty+writeback > reclaimable/2

This is a good signal that the reason for no progress is the stale
IO most likely and we need to wait even if the bdi itself is not
congested. We can also increase the timeout to HZ/10 because this is an
extreme slow path - we are not doing any progress and stalling is better
than OOM.

This is a diff on top of the previous patch. I even think that this part
would deserve a separate patch for a better bisect-ability. My testing
shows that close-to-oom behaves better (I can use more memory for
memeaters without hitting OOM)

What do you think?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e28028681c59..fed1bb7ea43a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, target)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback = zone_page_state(zone, NR_WRITEBACK),
+				      dirty = zone_page_state(zone, NR_FILE_DIRTY);
+			if (did_some_progress)
+				goto retry;
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (2*(writeback + dirty) > reclaimable)
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else
+				cond_resched();
 			goto retry;
 		}
 	}

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/