linux-kernel - Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120502162412.GF11435@suse.de>
Date:	Wed, 2 May 2012 17:24:12 +0100
From:	Mel Gorman <mgorman@...e.de>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Linux-MM <linux-mm@...ck.org>,
	Linux-Netdev <netdev@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	David Miller <davem@...emloft.net>, Neil Brown <neilb@...e.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mike Christie <michaelc@...wisc.edu>,
	Eric B Munson <emunson@...bm.net>
Subject: Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC
 reserves are low and swap is backed by network storage

On Tue, May 01, 2012 at 03:24:37PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2012 13:17:02 +0100
> Mel Gorman <mgorman@...e.de> wrote:
> 
> > If swap is backed by network storage such as NBD, there is a risk
> > that a large number of reclaimers can hang the system by consuming
> > all PF_MEMALLOC reserves. To avoid these hangs, the administrator
> > must tune min_free_kbytes in advance which is a bit fragile.
> > 
> > This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
> > are in use. If the system is routinely getting throttled the system
> > administrator can increase min_free_kbytes so degradation is smoother
> > but the system will keep running.
> > 
> >
> > ...
> >
> > +static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	unsigned long pfmemalloc_reserve = 0;
> > +	unsigned long free_pages = 0;
> > +	int i;
> > +	bool wmark_ok;
> > +
> > +	for (i = 0; i <= ZONE_NORMAL; i++) {
> > +		zone = &pgdat->node_zones[i];
> > +		pfmemalloc_reserve += min_wmark_pages(zone);
> > +		free_pages += zone_page_state(zone, NR_FREE_PAGES);
> > +	}
> > +
> > +	wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
> 
> 	wmark_ok = free_pages > pfmemalloc_reserve / 2;
> 

Of course, I don't know what I was on when I wrote that particular line.

> > +
> > +	/* kswapd must be awake if processes are being throttled */
> > +	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> > +		pgdat->classzone_idx = min(pgdat->classzone_idx,
> > +						(enum zone_type)ZONE_NORMAL);
> > +		wake_up_interruptible(&pgdat->kswapd_wait);
> > +	}
> > +
> > +	return wmark_ok;
> > +}
> > +
> > +/*
> > + * Throttle direct reclaimers if backing storage is backed by the network
> > + * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> > + * depleted. kswapd will continue to make progress and wake the processes
> > + * when the low watermark is reached
> > + */
> > +static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > +					nodemask_t *nodemask)
> > +{
> > +	struct zone *zone;
> > +	int high_zoneidx = gfp_zone(gfp_mask);
> > +	pg_data_t *pgdat;
> > +
> > +	/* Kernel threads such as kjournald should not be throttled */
> 
> The comment should explain "why", not "what".  Particularly when the
> "what" was bleedin obvious ;)
> 
> Also...   why?
> 

        /*
         * Kernel threads should not be throttled as they may be indirectly
         * responsible for cleaning pages necessary for reclaim to make forward
         * progress. kjournald for example may enter direct reclaim while
         * committing a transaction where throttling it could forcing other
         * processes to block on log_wait_commit()
         */

Does that help?

> > +	if (current->flags & PF_KTHREAD)
> > +		return;
> > +
> > +	/* Check if the pfmemalloc reserves are ok */
> > +	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> > +	pgdat = zone->zone_pgdat;
> > +	if (pfmemalloc_watermark_ok(pgdat))
> > +		return;
> > +
> > +	/*
> > +	 * If the caller cannot enter the filesystem, it's possible that it
> > +	 * is processing a journal transaction. In this case, it is not safe
> > +	 * to block on pfmemalloc_wait as kswapd could also be blocked waiting
> > +	 * to start a transaction. Instead, throttle for up to a second before
> > +	 * the reclaim must continue.
> > +	 */
> 
> I suppose this applies to fs locks in general, not just to
> journal_start()?
> 

Yes. I updated the comment to reflect that.

        /*
         * If the caller cannot enter the filesystem, it's possible that it
         * is due to the caller holding an FS lock or performing a journal
         * transaction in the case of a filesystem like ext[3|4]. In this case,
         * it is not safe to block on pfmemalloc_wait as kswapd could be
         * blocked waiting on the same lock. Instead, throttle for up to a
         * second before continuing.
         */


> > +	if (!(gfp_mask & __GFP_FS)) {
> > +		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> > +			pfmemalloc_watermark_ok(pgdat), HZ);
> > +		return;
> > +	}
> > +
> > +	/* Throttle until kswapd wakes the process */
> > +	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> > +		pfmemalloc_watermark_ok(pgdat));
> > +}
> > +
> >  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  				gfp_t gfp_mask, nodemask_t *nodemask)
> >  {
> >
> > ...
> >
> > @@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >  	if (remaining)
> >  		return true;
> >  
> > +	/*
> > +	 * There is a potential race between when kswapd checks it watermarks
> 
> "its"
> 

Fixed.

> > +	 * and a process gets throttled. There is also a potential race if
> > +	 * processes get throttled, kswapd wakes, a large process exits therby
> > +	 * balancing the zones that causes kswapd to miss a wakeup. If kswapd
> > +	 * is going to sleep, no process should be sleeping on pfmemalloc_wait
> > +	 * so wake them now if necessary. If necessary, processes will wake
> > +	 * kswapd and get throttled again
> > +	 */
> 
> Yes, the possibility for missed wakeups here worried me.  There's no
> synchronization and it would be easy to leave holes.
> 
> It's good that there is no timeout on the throttling - a timeout would
> cover up rare races most nastily.
> 

Yes and I wanted to avoid that. If there is a lost wakup, sysrq+t should
show processes stuck in throttle_direct_reclaim() while kswapd is asleep.

> > +	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> > +		wake_up(&pgdat->pfmemalloc_wait);
> > +		return true;
> > +	}
> 
> A bool-returning function called "sleeping_prematurely" should have no
> side-effects.  But it now performs wakeups.  Wanna see if there is a
> way of making this nicer?
> 

Minimally, the two instances of "There is a potential race" was a
merging mistake so I deleted the one in kswapd_try_to_sleep().

I looked at moving this wake_up outside sleeping_prematurely() but it
looked worse really. What I did instead was rename
sleeping_prematurely() to prepare_kswapd_sleep() and and commented it
like this

/*
 * Prepare kswapd for sleeping. This verifies that there are no processes
 * waiting in throttle_direct_reclaim() and that watermarks have been
 * met.
 *
 * Returns true if kswapd is ready to sleep
 */
static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
                                        int classzone_idx)

It's cheating a bit but a name like "prepare" implies that it may have
side-effects.

> >  	/* Check the watermark levels */
> >  	for (i = 0; i <= classzone_idx; i++) {
> >  		struct zone *zone = pgdat->node_zones + i;
> > @@ -2871,6 +2961,12 @@ loop_again:
> >  			}
> >  
> >  		}
> > +
> > +		/* Wake throttled direct reclaimers if low watermark is met */
> 
> s/"what"/"why"/ !
> 

                /*
                 * If the low watermark is met there is no need for processes
                 * to be throttled on pfmemalloc_wait as they should not be
                 * able to safely make forward progress. Wake them
                 */

?

Here is how the patch currently stands

---8<---
mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage

If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.

Signed-off-by: Mel Gorman <mgorman@...e.de>
---
 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |    1 +
 mm/vmscan.c            |  128 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dff7115..e6b733d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -663,6 +663,7 @@ typedef struct pglist_data {
 					     range, including holes */
 	int node_id;
 	wait_queue_head_t kswapd_wait;
+	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e225a7c..b9eb64a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4326,6 +4326,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33c332b..6f322e8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2431,6 +2431,80 @@ out:
 	return 0;
 }
 
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	unsigned long pfmemalloc_reserve = 0;
+	unsigned long free_pages = 0;
+	int i;
+	bool wmark_ok;
+
+	for (i = 0; i <= ZONE_NORMAL; i++) {
+		zone = &pgdat->node_zones[i];
+		pfmemalloc_reserve += min_wmark_pages(zone);
+		free_pages += zone_page_state(zone, NR_FREE_PAGES);
+	}
+
+	wmark_ok = free_pages > pfmemalloc_reserve / 2;
+
+	/* kswapd must be awake if processes are being throttled */
+	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+		pgdat->classzone_idx = min(pgdat->classzone_idx,
+						(enum zone_type)ZONE_NORMAL);
+		wake_up_interruptible(&pgdat->kswapd_wait);
+	}
+
+	return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+					nodemask_t *nodemask)
+{
+	struct zone *zone;
+	int high_zoneidx = gfp_zone(gfp_mask);
+	pg_data_t *pgdat;
+
+	/*
+	 * Kernel threads should not be throttled as they may be indirectly
+	 * responsible for cleaning pages necessary for reclaim to make forward
+	 * progress. kjournald for example may enter direct reclaim while
+	 * committing a transaction where throttling it could forcing other
+	 * processes to block on log_wait_commit().
+	 */
+	if (current->flags & PF_KTHREAD)
+		return;
+
+	/* Check if the pfmemalloc reserves are ok */
+	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+	pgdat = zone->zone_pgdat;
+	if (pfmemalloc_watermark_ok(pgdat))
+		return;
+
+	/*
+	 * If the caller cannot enter the filesystem, it's possible that it
+	 * is due to the caller holding an FS lock or performing a journal
+	 * transaction in the case of a filesystem like ext[3|4]. In this case,
+	 * it is not safe to block on pfmemalloc_wait as kswapd could be
+	 * blocked waiting on the same lock. Instead, throttle for up to a
+	 * second before continuing.
+	 */
+	if (!(gfp_mask & __GFP_FS)) {
+		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+			pfmemalloc_watermark_ok(pgdat), HZ);
+		return;
+	}
+
+	/* Throttle until kswapd wakes the process */
+	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+		pfmemalloc_watermark_ok(pgdat));
+}
+
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
@@ -2449,6 +2523,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.gfp_mask = sc.gfp_mask,
 	};
 
+	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+	/*
+	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
+	 * that the page allocator does not consider triggering OOM
+	 */
+	if (fatal_signal_pending(current))
+		return 1;
+
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
 				gfp_mask);
@@ -2598,8 +2681,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages >= (present_pages >> 2);
 }
 
-/* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
+/*
+ * Prepare kswapd for sleeping. This verifies that there are no processes
+ * waiting in throttle_direct_reclaim() and that watermarks have been met.
+ *
+ * Returns true if kswapd is ready to sleep
+ */
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
 {
 	int i;
@@ -2608,7 +2696,21 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
-		return true;
+		return false;
+
+	/*
+	 * There is a potential race between when kswapd checks its watermarks
+	 * and a process gets throttled. There is also a potential race if
+	 * processes get throttled, kswapd wakes, a large process exits therby
+	 * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+	 * is going to sleep, no process should be sleeping on pfmemalloc_wait
+	 * so wake them now if necessary. If necessary, processes will wake
+	 * kswapd and get throttled again
+	 */
+	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+		wake_up(&pgdat->pfmemalloc_wait);
+		return false;
+	}
 
 	/* Check the watermark levels */
 	for (i = 0; i <= classzone_idx; i++) {
@@ -2641,9 +2743,9 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return !pgdat_balanced(pgdat, balanced, classzone_idx);
+		return pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
-		return !all_zones_ok;
+		return all_zones_ok;
 }
 
 /*
@@ -2871,6 +2973,16 @@ loop_again:
 			}
 
 		}
+
+		/*
+		 * If the low watermark is met there is no need for processes
+		 * to be throttled on pfmemalloc_wait as they should not be
+		 * able to safely make forward progress. Wake them
+		 */
+		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+				pfmemalloc_watermark_ok(pgdat))
+			wake_up(&pgdat->pfmemalloc_wait);
+
 		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
 		/*
@@ -2971,7 +3083,7 @@ out:
 	}
 
 	/*
-	 * Return the order we were reclaiming at so sleeping_prematurely()
+	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
 	 * makes a decision on the order we were last reclaiming at. However,
 	 * if another caller entered the allocator slow path while kswapd
 	 * was awake, order will remain at the higher level
@@ -2991,7 +3103,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3001,7 +3113,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/