linux-ext4 - Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210914163432.GR3828@suse.com>
Date:   Tue, 14 Sep 2021 17:34:32 +0100
From:   Mel Gorman <mgorman@...e.com>
To:     NeilBrown <neilb@...e.de>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        "Darrick J. Wong" <djwong@...nel.org>, Jan Kara <jack@...e.cz>,
        Michal Hocko <mhocko@...e.com>,
        Matthew Wilcox <willy@...radead.org>,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-nfs@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> Indefinite loops waiting for memory allocation are discouraged by
> documentation in gfp.h which says the use of __GFP_NOFAIL that it
> 
>  is definitely preferable to use the flag rather than opencode endless
>  loop around allocator.
> 
> Such loops that use congestion_wait() are particularly unwise as
> congestion_wait() is indistinguishable from
> schedule_timeout_uninterruptible() in practice - and should be
> deprecated.
> 
> So this patch changes the two loops in ext4_ext_truncate() to use
> __GFP_NOFAIL instead of looping.
> 
> As the allocation is multiple layers deeper in the call stack, this
> requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> places.
> 
> Of particular interest is the ext4_journal_start family of calls which
> can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> a high bit, so it is safe in practice.
> 
> jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> used for *all* allocations.
> 
> Signed-off-by: NeilBrown <neilb@...e.de>

I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
the risk of a livelock if memory is completely depleted where as some
callers can afford to wait.

The key event should be reclaim making progress. The hack below is
intended to vaguely demonstrate how blocking can be based on reclaim
making progress instead of "congestion" but has not even been booted. A
more complete overhaul may involve introducing
reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask)
and
reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout)
and converting congestion_wait and wait_iff_congestion to calling
reclaim_congestion_wait_nodemask which waits on the first usable node
and then audit every single congestion_wait() user to see which API
they should call. Further work would be to establish whether the page allocator should
call reclaim_congestion_wait_nodemask() if direct reclaim is not making
progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL
could then soften its access to emergency reserves but I haven't given
it much thought.

Yes it's significant work, but it would be a better than letting
__GFP_NOFAIL propagate further and kicking us down the road.

This hack is terrible, it's not the right way to do it, it's just to
illustrate the idea of "waiting on memory should be based on reclaim
making progress and not the state of storage" is not impossible.

--8<--
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5c0318509f9e..5ed81c5746ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -832,6 +832,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
+	wait_queue_head_t reclaim_wait;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6122c78ce914..21a9cd693d12 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include <linux/swap.h>
 #include <trace/events/writeback.h>
 
 struct backing_dev_info noop_backing_dev_info;
@@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 EXPORT_SYMBOL(set_bdi_congested);
 
 /**
- * congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
+ * congestion_wait - the docs are now worthless but avoiding a rename
  *
- * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
- * write congestion.  If no backing_devs are congested then just wait for the
- * next write to be completed.
+ * New thing -- wait for a timeout or reclaim to make progress
  */
 long congestion_wait(int sync, long timeout)
 {
+	pg_data_t *pgdat;
 	long ret;
 	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	wait_queue_head_t *wqh;
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
+	/* Never let kswapd sleep on itself */
+	if (current_is_kswapd())
+		goto trace;
+
+	/*
+	 * Dangerous, local memory may be forbidden by cpuset or policies,
+	 * use first eligible zone in zonelists node instead
+	 */
+	preempt_disable();
+	pgdat = NODE_DATA(smp_processor_id());
+	preempt_enable();
+	wqh = &pgdat->reclaim_wait;
+
+	/*
+	 * Should probably check watermark of suitable zones here
+	 * in case this is spuriously called
+	 */
+
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+trace:
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b09e71c9ce7..4b87b73d1264 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7418,6 +7418,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
 
+	init_waitqueue_head(&pgdat->reclaim_wait);
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 158c9c93d03c..0ac2cf6be5e3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2888,6 +2888,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
 }
 
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+
 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -3070,6 +3072,18 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				    sc))
 		goto again;
 
+	/*
+	 * Might be race-prone, more appropriate to do this when exiting
+	 * direct reclaim and when kswapd finds that pgdat is balanced.
+	 * May also be appropriate to update pgdat_balanced to take
+	 * a watermark level and wakeup when min watermarks are ok
+	 * instead of waiting for the high watermark
+	 */
+	if (waitqueue_active(&pgdat->reclaim_wait) &&
+	    pgdat_balanced(pgdat, 0, ZONE_MOVABLE)) {
+		wake_up_interruptible(&pgdat->reclaim_wait);
+	}
+
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
 	 * many failures to reclaim anything from them and goes to