linux-kernel - Re: unnecessary timer interrupt of slab.c and bdi tasks when the system is in sleep state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101001141932.GA5467@nowhere>
Date:	Fri, 1 Oct 2010 16:20:12 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Christoph Lameter <cl@...ux.com>,
	Pekka Enberg <penberg@...helsinki.fi>
Cc:	"Wu, Xia" <xia.wu@...el.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
	"mingo@...e.hu" <mingo@...e.hu>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"Van De Ven, Arjan" <arjan.van.de.ven@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Zhu, Daniel" <daniel.zhu@...el.com>,
	"Wang, Yong Y" <yong.y.wang@...el.com>
Subject: Re: unnecessary timer interrupt of slab.c and bdi tasks when the
	system is in sleep state

On Thu, Sep 30, 2010 at 09:13:02AM -0500, Christoph Lameter wrote:
> On Thu, 30 Sep 2010, Wu, Xia wrote:
> 
> > I found some unnecessary timer interrupts when the system enter sleep state.
> > (1)     /mm/slab.c
> > cache_reap() clean up on allocated memory every 2s. If the system is in sleep state, the system is waked-up when this timer expires. In fact,
> > there isn't more slabs to been cleaned up in sleep state.
> 
> Right. We could switch off the timer when idle without much of an issue.
> The expiration of the caches wont occur and so we will have stale objects
> on the queues when we exit sleep state. You could flush the queues before
> switching off the timers?


May be flushing the queue everytime we enter nohz is too much, as that can
happen very often?

Another idea would be to disarm the scheduled work if it is the next timer
that will exit nohz and then in this case schedule it right now instead,
before entering idle.

Otherwise if this timer is after the time nohz will exit (ie: there is another
timer before), then we don't care, and that should probably cover most cases.

But we really need a quick way to know if cache_reap() is needed at any time,
this is really what we lack, so that we even further optimize with such state
machine:

enter nohz:

* cache reap not needed? just cancel the delayed work
* cache_reap needed and this is the timer that will wake up
  nohz? cancel the timer and schedule the work right now.
* otherwise don't change anything

exit_nohz:

* work was cancelled? reschedule it.


But for that we need a kind of need_cache_reap() that I'm not sure how
to implement as I don't know much the slab code.

Does that sound silly?

Here is how that could look like (untested, since that requires need_cache_reap()):


diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index b2f1a4d..34f2935 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -277,5 +277,13 @@ extern struct blocking_notifier_head reboot_notifier_list;
 #define VT_UPDATE		0x0004 /* A bigger update occurred */
 #define VT_PREWRITE		0x0005 /* A char is about to be written to the console */
 
+/* Dyntick events */
+#define NOHZ_ENTER_PREPARE	0x0001 /*
+					* Still have a chance here to wake up a task
+					* or change a timer state
+					*/
+#define NOHZ_ENTER		0x0002
+#define NOHZ_EXIT		0x0003
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_NOTIFIER_H */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..f4998cf 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -7,6 +7,7 @@
 #define _LINUX_TICK_H
 
 #include <linux/clockchips.h>
+#include <linux/notifier.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -121,12 +122,17 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+extern int register_nohz_notifier(struct notifier_block *nb);
 extern void tick_nohz_stop_sched_tick(int inidle);
 extern void tick_nohz_restart_sched_tick(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 # else
+static inline int register_nohz_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
 static inline void tick_nohz_stop_sched_tick(int inidle) { }
 static inline void tick_nohz_restart_sched_tick(void) { }
 static inline ktime_t tick_nohz_get_sleep_length(void)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e216e0..7b0f5e0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -248,6 +248,24 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+
+static RAW_NOTIFIER_HEAD(nohz_chain);
+
+int register_nohz_notifier(struct notifier_block *nb)
+{
+	return raw_notifier_chain_register(&nohz_chain, nb);
+}
+
+static int nohz_notify(unsigned long val, void *arg)
+{
+	int ret;
+
+	ret = __raw_notifier_call_chain(&nohz_chain, val, arg,
+					-1, NULL);
+
+	return notifier_to_errno(ret);
+}
+
 /**
  * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
  *
@@ -262,7 +280,7 @@ void tick_nohz_stop_sched_tick(int inidle)
 	ktime_t last_update, expires, now;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
-	int cpu;
+	int cpu, err;
 
 	local_irq_save(flags);
 
@@ -324,21 +342,29 @@ void tick_nohz_stop_sched_tick(int inidle)
 		time_delta = timekeeping_max_deferment();
 	} while (read_seqretry(&xtime_lock, seq));
 
-	if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
-	    arch_needs_cpu(cpu)) {
-		next_jiffies = last_jiffies + 1;
-		delta_jiffies = 1;
-	} else {
-		/* Get the next timer wheel timer */
-		next_jiffies = get_next_timer_interrupt(last_jiffies);
-		delta_jiffies = next_jiffies - last_jiffies;
-	}
-	/*
-	 * Do not stop the tick, if we are only one off
-	 * or if the cpu is required for rcu
-	 */
-	if (!ts->tick_stopped && delta_jiffies == 1)
-		goto out;
+	do {
+		if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
+		    arch_needs_cpu(cpu)) {
+			next_jiffies = last_jiffies + 1;
+			delta_jiffies = 1;
+		} else {
+			/* Get the next timer wheel timer */
+			next_jiffies = get_next_timer_interrupt(last_jiffies);
+			delta_jiffies = next_jiffies - last_jiffies;
+		}
+		/*
+		 * Do not stop the tick, if we are only one off
+		 * or if the cpu is required for rcu
+		 */
+		if (!ts->tick_stopped && delta_jiffies == 1)
+			goto out;
+
+		err = nohz_notify(NOHZ_ENTER_PREPARE, &next_jiffies);
+
+	} while (err == -EAGAIN));
+
+	if (err)
+		goto end;
 
 	/* Schedule the tick, if we are at least one jiffie off */
 	if ((long)delta_jiffies >= 1) {
@@ -410,6 +436,8 @@ void tick_nohz_stop_sched_tick(int inidle)
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 			ts->idle_jiffies = last_jiffies;
+
+			nohz_notify(NOHZ_ENTER, cpu);
 			rcu_enter_nohz();
 		}
 
@@ -521,6 +549,7 @@ void tick_nohz_restart_sched_tick(void)
 	ts->inidle = 0;
 
 	rcu_exit_nohz();
+	nohz_notify(NOHZ_EXIT, cpu);
 
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
diff --git a/mm/slab.c b/mm/slab.c
index fcae981..662a341 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -115,6 +115,7 @@
 #include	<linux/debugobjects.h>
 #include	<linux/kmemcheck.h>
 #include	<linux/memory.h>
+#include	<linux/tick.h>
 
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
@@ -1323,10 +1324,64 @@ static int __cpuinit cpuup_callback(struct notifier_block *nfb,
 	return notifier_from_errno(err);
 }
 
-static struct notifier_block __cpuinitdata cpucache_notifier = {
+static struct notifier_block __cpuinitdata cpucache_cpu_notifier = {
 	&cpuup_callback, NULL, 0
 };
 
+#ifdef CONFIG_NO_HZ
+static int nohz_callback(struct notifier_block *nfb,
+			 unsigned long action, void *arg)
+{
+	struct delayed_work *work = &__get_cpu_var(slab_reap_work, cpu);
+	int err = 0;
+
+	switch (action) {
+	case NOHZ_ENTER_PREPARE:
+		unsigned long nohz_exit = (unsigned long)arg;
+
+		if (!delayed_work_pending(work))
+			goto end;
+
+		/*
+		 * Nothing to flush. Cancel the work and force
+		 * tick_nohz_stop_sched_tick to re-find
+		 * the next timer interrupt.
+		 *
+		 * need_cache_reap() has yet to be implemented :-)
+		 */
+		if (!need_cache_reap(smp_processor_id())) {
+			cancel_delayed_work(work);
+			err = -EAGAIN;
+			goto end;
+		}
+
+		/*
+		 * Something to flush and the next timer is _likely_ us.
+		 * Cancel nohz and schedule the work right now to avoid
+		 * breaking long idle cycles.
+		 */
+		if (time_before_eq(work->timer.expires, nohz_exit)) {
+			cancel_delayed_work(work);
+			schedule_delayed_work(work, 0);
+			err = -EBUSY;
+		}
+		break;
+
+	case NOHZ_EXIT:
+		if (!delayed_work_pending(work))
+			schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_CPUC));
+		break;
+	}
+
+end:
+	return notifier_from_errno(0);
+}
+
+static struct notifier_block cpucache_nohz_notifier = {
+	&nohz_callback, NULL, 0
+};
+#endif
+
 #if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
 /*
  * Drains freelist for a node on each slab cache, used for memory hot-remove.
@@ -1638,7 +1693,11 @@ void __init kmem_cache_init_late(void)
 	 * Register a cpu startup notifier callback that initializes
 	 * cpu_cache_get for all new cpus
 	 */
-	register_cpu_notifier(&cpucache_notifier);
+	register_cpu_notifier(&cpucache_cpu_notifier);
+
+#ifdef CONFIG_NO_HZ
+	register_nohz_notifier(&cpucache_nohz_notifier);
+#endif
 
 #ifdef CONFIG_NUMA
 	/*
@@ -4118,6 +4177,9 @@ static void cache_reap(struct work_struct *w)
 	int node = numa_mem_id();
 	struct delayed_work *work = to_delayed_work(w);
 
+	if (!need_cache_reap(smp_processor_id()))
+		return;
+
 	if (!mutex_trylock(&cache_chain_mutex))
 		/* Give up. Setup the next iteration. */
 		goto out;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/