linux-kernel - Re: [PATCH 3/4] mm, page_alloc: Drain per-cpu pages from workqueue context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170124235457.x7ssjun5ht2ycyac@techsingularity.net>
Date:   Tue, 24 Jan 2017 23:54:57 +0000
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Tejun Heo <tj@...nel.org>
Cc:     Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Hillf Danton <hillf.zj@...baba-inc.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Petr Mladek <pmladek@...e.cz>
Subject: Re: [PATCH 3/4] mm, page_alloc: Drain per-cpu pages from workqueue
 context

On Tue, Jan 24, 2017 at 11:07:22AM -0500, Tejun Heo wrote:
> Hello, Mel.
> 
> On Mon, Jan 23, 2017 at 11:04:29PM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 03:55:01PM -0500, Tejun Heo wrote:
> > > Hello, Mel.
> > > 
> > > On Mon, Jan 23, 2017 at 08:04:12PM +0000, Mel Gorman wrote:
> > > > What is the actual mechanism that does that? It's not something that
> > > > schedule_on_each_cpu does and one would expect that the core workqueue
> > > > implementation would get this sort of detail correct. Or is this a proposal
> > > > on how it should be done?
> > > 
> > > If you use schedule_on_each_cpu(), it's all fine as the thing pins
> > > cpus and waits for all the work items synchronously.  If you wanna do
> > > it asynchronously, right now, you'll have to manually synchronize work
> > > items against the offline callback manually.
> > > 
> > 
> > Is the current implementation and what it does wrong in some way? I ask
> > because synchronising against the offline callback sounds like it would
> > be a bit of a maintenance mess for relatively little gain.
> 
> As long as you wrap them with get/put_online_cpus(), the current
> implementation should be fine.  If it were up to me, I'd rather use
> static percpu work_structs and synchronize with a mutex tho.  The cost
> of synchronizing via mutex isn't high here compared to the overall
> operation, the whole thing is synchronous anyway and you won't have to
> worry about falling back.
> 

The synchronisation is not even required in all cases. Multiple direct
reclaimers synching to do the drain doesn't necessarily make sense for
example. How does the following look to you?

---8<---
mm, page_alloc: Use static global work_struct for draining per-cpu pages

As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
work_struct to co-ordinate the draining of per-cpu pages on the workqueue.
Only one task can drain at a time but this is better than the previous
scheme that allowed multiple tasks to send IPIs at a time.

One consideration is whether parallel requests should synchronise against
each other. This patch does not synchronise for a global drain. The common
case for such callers is expected to be multiple parallel direct reclaimers
competing for pages when the watermark is close to min. Draining the
per-cpu list is unlikely to make much progress and serialising the drain
is of dubious merit in that case. Drains are synchonrised for callers such
as memory hotplug and CMA that care about the drain being complete when
the function returns.

Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
---
 mm/page_alloc.c | 41 +++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e87508ffa759..da6be2a5ff7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -92,6 +92,10 @@ EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 int _node_numa_mem_[MAX_NUMNODES];
 #endif
 
+/* work_structs for global per-cpu drains */
+DEFINE_MUTEX(pcpu_drain_mutex);
+DEFINE_PER_CPU(struct work_struct, pcpu_drain);
+
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
 EXPORT_SYMBOL(latent_entropy);
@@ -2351,7 +2355,6 @@ static void drain_local_pages_wq(struct work_struct *work)
  */
 void drain_all_pages(struct zone *zone)
 {
-	struct work_struct __percpu *works;
 	int cpu;
 
 	/*
@@ -2365,11 +2368,21 @@ void drain_all_pages(struct zone *zone)
 		return;
 
 	/*
+	 * Do not drain if one is already in progress unless it's specific to
+	 * a zone. Such callers are primarily CMA and memory hotplug and need
+	 * the drain to be complete when the call returns.
+	 */
+	if (unlikely(!mutex_trylock(&pcpu_drain_mutex))) {
+		if (!zone)
+			return;
+		mutex_lock(&pcpu_drain_mutex);
+	}
+
+	/*
 	 * As this can be called from reclaim context, do not reenter reclaim.
 	 * An allocation failure can be handled, it's simply slower
 	 */
 	get_online_cpus();
-	works = alloc_percpu_gfp(struct work_struct, GFP_ATOMIC);
 
 	/*
 	 * We don't care about racing with CPU hotplug event
@@ -2402,24 +2415,16 @@ void drain_all_pages(struct zone *zone)
 			cpumask_clear_cpu(cpu, &cpus_with_pcps);
 	}
 
-	if (works) {
-		for_each_cpu(cpu, &cpus_with_pcps) {
-			struct work_struct *work = per_cpu_ptr(works, cpu);
-			INIT_WORK(work, drain_local_pages_wq);
-			schedule_work_on(cpu, work);
-		}
-		for_each_cpu(cpu, &cpus_with_pcps)
-			flush_work(per_cpu_ptr(works, cpu));
-	} else {
-		for_each_cpu(cpu, &cpus_with_pcps) {
-			struct work_struct work;
-
-			INIT_WORK(&work, drain_local_pages_wq);
-			schedule_work_on(cpu, &work);
-			flush_work(&work);
-		}
+	for_each_cpu(cpu, &cpus_with_pcps) {
+		struct work_struct *work = per_cpu_ptr(&pcpu_drain, cpu);
+		INIT_WORK(work, drain_local_pages_wq);
+		schedule_work_on(cpu, work);
 	}
+	for_each_cpu(cpu, &cpus_with_pcps)
+		flush_work(per_cpu_ptr(&pcpu_drain, cpu));
+
 	put_online_cpus();
+	mutex_unlock(&pcpu_drain_mutex);
 }
 
 #ifdef CONFIG_HIBERNATION