linux-kernel - [RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1477638642-17428-3-git-send-email-markivx@codeaurora.org>
Date:   Fri, 28 Oct 2016 00:10:41 -0700
From:   Vikram Mulukutla <markivx@...eaurora.org>
To:     linux-kernel@...r.kernel.org
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Srivatsa Vaddagiri <vatsa@...eaurora.org>,
        Steve Muckle <steve.muckle@...aro.org>,
        Olav Haugan <ohaugan@...eaurora.org>,
        Syed Rameez Mustafa <rameezmustafa@...eaurora.org>,
        Joonwoo Park <joonwoop@...eaurora.org>,
        Pavankumar Kondeti <pkondeti@...eaurora.org>,
        Saravana Kannan <skannan@...eaurora.org>,
        Bryan Huntsman <bryanh@...eaurora.org>,
        Juri Lelli <juri.lelli@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Chris Redpath <chris.redpath@....com>,
        Robin Randhawa <robin.randhawa@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Todd Kjos <tkjos@...gle.com>,
        Srinath Sridharan <srinathsr@...gle.com>,
        Andres Oportus <andresoportus@...gle.com>,
        Leo Yan <leo.yan@...aro.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Vikram Mulukutla <markivx@...eauorora.org>,
        Vikram Mulukutla <markivx@...eaurora.org>
Subject: [RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking

From: Srivatsa Vaddagiri <vatsa@...eaurora.org>

This patch implements an alternative window-based CPU utilization
tracking mechanism in the scheduler. Per task and per CPU counters are
updated with utilization statistics using a synchronized (across CPUs)
time source and a single statistic (prev_runnable_sum) is fed to schedutil
in an upcoming patch. A windowed view of time (window size determined by
walt_ravg_window) is used to determine CPU utilization.

There are two per-CPU-rq quantities maintained by Window Assisted Load Tracking
(WALT), both normalized to the max possible frequency and the max efficiency
(IPC) of that CPU (provided the arch supports the invariance APIs):

    curr_runnable_sum: aggregate utilization of all tasks that executed
    during the current (not yet completed) window

    prev_runnable_sum: aggregate utilization of all tasks that executed
    during the most recent completed window

prev_runnable_sum is the primary statistic used to guide CPU frequency in
lieu of PELT's cfs_rq->avg.util_avg. No additional policy is imposed on this
statistic, the assumption being that the consumer (e.g., schedutil) will
perform appropriate policy decisions (e.g., headroom) before deciding the
next P-state.

Corresponding to the aggregate statistics, WALT also mantains the
following stats per task:

    curr_window - represents the cpu utilization of the task in its most
    recently tracked window

    prev_window - represents cpu utilization of task in the window prior
    to the one being tracked by curr_window

WALT statistic updates are event driven, with updates occurring in
scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch),
task wakeup and during task migration. Migration simply involves removing a
tasks's curr_window and prev_window from the source CPU's curr_runnable sum
and prev_runnable_sum, and adding the per-task counters to the destination
CPU's aggregate CPU counters. Execution time in an IRQ handler is accounted
in a CPU's curr_runnable_sum statistic, provided that the CPU was also
executing the idle task for the duration of the interrupt handler.

Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU
rq has tasks blocked on IO, idle-task execution is accounted in per-task
and per-CPU counters. Setting walt_io_is_busy will also cause interrupt
handlers in the idle task to update counters as if the idle task was
executing (instead of just the interrupt handler execution time).

The major tunable provided by WALT is walt_ravg_window, which represents
window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy
(described above) is set to 0 by default.

Potential upcoming changes/improvements include: the use of sched_clock
instead of ktime_get as a time source and support for an unsynchronized
(across CPUs) time source. The event codes (PUT_PREV_TASK etc.) may also
potentially be replaced by combinations of existing flags, provided that
doesn't make the implementation too fragile.

Signed-off-by: Srivatsa Vaddagiri <vatsa@...eaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@...eaurora.org>
---
 include/linux/sched/sysctl.h |   1 +
 init/Kconfig                 |   9 +
 kernel/sched/Makefile        |   1 +
 kernel/sched/cputime.c       |  10 +-
 kernel/sched/walt.c          | 428 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/walt.h          |  69 +++++++
 kernel/sysctl.c              |  11 ++
 7 files changed, 528 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/walt.c
 create mode 100644 kernel/sched/walt.h

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 22db1e6..7007815 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -31,6 +31,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_sched_use_walt_metrics;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..1e629ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -390,6 +390,15 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
+config SCHED_WALT
+	bool "Support window based load tracking"
+	depends on SMP
+	help
+	This feature will allow the scheduler to maintain a tunable window
+	based set of metrics for tasks and runqueues. These metrics can be
+	used to guide task placement as well as task frequency requirements
+	for cpufreq governors.
+
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5e59b83..41ada04 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,6 +19,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o swait.o completion.o idle.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
+obj-$(CONFIG_SCHED_WALT) += walt.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a846cf8..704a6f6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -51,12 +51,15 @@ void irqtime_account_irq(struct task_struct *curr)
 {
 	s64 delta;
 	int cpu;
+	u64 wallclock;
+	boot account = true;
 
 	if (!sched_clock_irqtime)
 		return;
 
 	cpu = smp_processor_id();
-	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
+	wallclock = sched_clock_cpu(cpu);
+	delta = wallclock - __this_cpu_read(irq_start_time);
 	__this_cpu_add(irq_start_time, delta);
 
 	irq_time_write_begin();
@@ -70,6 +73,11 @@ void irqtime_account_irq(struct task_struct *curr)
 		__this_cpu_add(cpu_hardirq_time, delta);
 	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
 		__this_cpu_add(cpu_softirq_time, delta);
+	else
+		account = false;
+
+	if (account && is_idle_task(curr))
+		walt_account_irqtime(cpu, curr, delta, wallclock);
 
 	irq_time_write_end();
 }
diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c
new file mode 100644
index 0000000..ed9fde1
--- /dev/null
+++ b/kernel/sched/walt.c
@@ -0,0 +1,428 @@
+/*
+ * Copyright (c) 2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ *
+ * Window Assisted Load Tracking (WALT) implementation credits:
+ * Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
+ * Pavan Kumar Kondeti, Olav Haugan
+ *
+ * 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
+ *             and Todd Kjos
+ * 2016-08-31: Integration with mainline by Srivatsa Vaddagiri
+ *             and Vikram Mulukutla
+ */
+
+#include <linux/syscore_ops.h>
+#include <linux/cpufreq.h>
+#include "sched.h"
+#include "walt.h"
+
+
+char *task_event_names[] = {"PUT_PREV_TASK", "PICK_NEXT_TASK",
+				"TASK_WAKE", "TASK_MIGRATE", "TASK_UPDATE",
+				"IRQ_UPDATE"};
+
+__read_mostly unsigned int sysctl_sched_use_walt_metrics = 1;
+
+static __read_mostly unsigned int walt_freq_account_wait_time;
+static __read_mostly unsigned int walt_io_is_busy;
+
+/* 1 -> use PELT based load stats, 0 -> use window-based load stats */
+static unsigned int __read_mostly walt_disabled;
+
+/* Window size (in ns) */
+__read_mostly unsigned int walt_ravg_window = 20000000;
+
+/* Min window size (in ns) = 5ms */
+#define MIN_SCHED_RAVG_WINDOW 5000000
+
+/* Max window size (in ns) = 1s */
+#define MAX_SCHED_RAVG_WINDOW 1000000000
+
+static unsigned int sync_cpu;
+static ktime_t ktime_last;
+static bool walt_ktime_suspended;
+
+u64 walt_ktime_clock(void)
+{
+	if (unlikely(walt_ktime_suspended))
+		return ktime_to_ns(ktime_last);
+	return ktime_get_ns();
+}
+
+static void walt_resume(void)
+{
+	walt_ktime_suspended = false;
+}
+
+static int walt_suspend(void)
+{
+	ktime_last = ktime_get();
+	walt_ktime_suspended = true;
+	return 0;
+}
+
+static struct syscore_ops walt_syscore_ops = {
+	.resume	= walt_resume,
+	.suspend = walt_suspend
+};
+
+static int __init walt_init_ops(void)
+{
+	register_syscore_ops(&walt_syscore_ops);
+	return 0;
+}
+late_initcall(walt_init_ops);
+
+static int __init set_walt_ravg_window(char *str)
+{
+	get_option(&str, &walt_ravg_window);
+
+	walt_disabled = (walt_ravg_window < MIN_SCHED_RAVG_WINDOW ||
+				walt_ravg_window > MAX_SCHED_RAVG_WINDOW);
+	return 0;
+}
+
+early_param("walt_ravg_window", set_walt_ravg_window);
+
+static void
+update_window_start(struct rq *rq, u64 wallclock)
+{
+	s64 delta;
+	int nr_windows;
+	u64 prev_sum = 0;
+
+	delta = wallclock - rq->window_start;
+	BUG_ON(delta < 0);
+	if (delta < walt_ravg_window)
+		return;
+
+	nr_windows = div64_u64(delta, walt_ravg_window);
+	if (nr_windows == 1)
+		prev_sum = rq->curr_runnable_sum;
+
+	rq->prev_runnable_sum = prev_sum;
+	rq->curr_runnable_sum = 0;
+
+	rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;
+}
+
+static u64 scale_exec_time(u64 delta, struct rq *rq)
+{
+	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu_of(rq));
+	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu_of(rq));
+	u64 scaled_delta = cap_scale(delta, scale_freq);
+
+	return cap_scale(scaled_delta, scale_cpu);
+}
+
+static int cpu_is_waiting_on_io(struct rq *rq)
+{
+	if (!walt_io_is_busy)
+		return 0;
+
+	return atomic_read(&rq->nr_iowait);
+}
+
+static int account_cpu_busy_time(struct rq *rq, struct task_struct *p,
+				     u64 irqtime, int event)
+{
+	if (is_idle_task(p)) {
+		/* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
+		if (event == PICK_NEXT_TASK)
+			return 0;
+
+		/* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
+		return irqtime || cpu_is_waiting_on_io(rq);
+	}
+
+	if (event == TASK_WAKE)
+		return 0;
+
+	if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
+					 event == TASK_UPDATE)
+		return 1;
+
+	/* Only TASK_MIGRATE && PICK_NEXT_TASK left */
+	return walt_freq_account_wait_time;
+}
+
+/*
+ * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum)
+ */
+static void update_cpu_busy_time(struct task_struct *p, struct rq *rq,
+	     int event, u64 wallclock, u64 irqtime)
+{
+	int new_window, nr_full_windows = 0;
+	u64 mark_start = p->ravg.mark_start;
+	u64 window_start = rq->window_start;
+	u32 window_size = walt_ravg_window;
+	u64 delta;
+
+	new_window = mark_start < window_start;
+	if (new_window)
+		nr_full_windows = div64_u64((window_start - mark_start),
+								window_size);
+
+	/* Handle window rollover */
+	if (new_window) {
+		if (!is_idle_task(p)) {
+			u32 curr_window = 0;
+
+			if (!nr_full_windows)
+				curr_window = p->ravg.curr_window;
+
+			p->ravg.prev_window = curr_window;
+			p->ravg.curr_window = 0;
+		}
+	}
+
+	if (!account_cpu_busy_time(rq, p, irqtime, event))
+		return;
+
+	if (!new_window) {
+		if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq))
+			delta = wallclock - mark_start;
+		else
+			delta = irqtime;
+		delta = scale_exec_time(delta, rq);
+		rq->curr_runnable_sum += delta;
+		if (!is_idle_task(p))
+			p->ravg.curr_window += delta;
+
+		return;
+	}
+
+	if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) {
+		if (!nr_full_windows) {
+			/*
+			 * A full window hasn't elapsed, account partial
+			 * contribution to previous completed window.
+			 */
+			delta = scale_exec_time(window_start - mark_start, rq);
+			p->ravg.prev_window += delta;
+		} else {
+			/*
+			 * Since at least one full window has elapsed,
+			 * the contribution to the previous window is the
+			 * full window (window_size).
+			 */
+			delta = scale_exec_time(window_size, rq);
+			p->ravg.prev_window = delta;
+		}
+		rq->prev_runnable_sum += delta;
+
+		/* Account piece of busy time in the current window. */
+		delta = scale_exec_time(wallclock - window_start, rq);
+		rq->curr_runnable_sum += delta;
+		p->ravg.curr_window = delta;
+
+		return;
+	}
+
+	if (irqtime) {
+		/* IRQ busy time start = wallclock - irqtime */
+		mark_start = wallclock - irqtime;
+
+		if (mark_start > window_start) {
+			rq->curr_runnable_sum += scale_exec_time(irqtime, rq);
+			return;
+		}
+
+		/*
+		 * IRQ busy time spanned multiple windows. Process the
+		 * busy time preceding the current window first
+		 */
+		delta = window_start - mark_start;
+		if (delta > window_size)
+			delta = window_size;
+		delta = scale_exec_time(delta, rq);
+		rq->prev_runnable_sum += delta;
+
+		/* Process the remaining IRQ busy time in the current window. */
+		delta = wallclock - window_start;
+		rq->curr_runnable_sum += scale_exec_time(delta, rq);
+
+		return;
+	}
+
+	BUG();
+}
+
+/* Reflect task activity on its demand and cpu's busy time statistics */
+void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
+	     enum task_event event, u64 wallclock, u64 irqtime)
+{
+	if (walt_disabled || !rq->window_start)
+		return;
+
+	lockdep_assert_held(&rq->lock);
+
+	update_window_start(rq, wallclock);
+
+	if (!p->ravg.mark_start)
+		goto done;
+
+	update_cpu_busy_time(p, rq, event, wallclock, irqtime);
+
+done:
+	p->ravg.mark_start = wallclock;
+}
+
+void walt_mark_task_starting(struct task_struct *p)
+{
+	u64 wallclock;
+	struct rq *rq = task_rq(p);
+
+	if (!rq->window_start)
+		return;
+
+	wallclock = walt_ktime_clock();
+	p->ravg.mark_start = wallclock;
+}
+
+void walt_set_window_start(struct rq *rq)
+{
+	int cpu = cpu_of(rq);
+	struct rq *sync_rq = cpu_rq(sync_cpu);
+	unsigned long flags;
+
+	if (rq->window_start || walt_ktime_clock() < walt_ravg_window)
+		return;
+
+	if (cpu == sync_cpu) {
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		rq->window_start = walt_ktime_clock();
+		rq->curr_runnable_sum = rq->prev_runnable_sum = 0;
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	} else {
+		local_irq_save(flags);
+		double_rq_lock(rq, sync_rq);
+		rq->window_start = cpu_rq(sync_cpu)->window_start;
+		rq->curr_runnable_sum = rq->prev_runnable_sum = 0;
+		double_rq_unlock(rq, sync_rq);
+		local_irq_restore(flags);
+	}
+}
+
+void walt_migrate_sync_cpu(int cpu)
+{
+	if (cpu == sync_cpu)
+		sync_cpu = smp_processor_id();
+}
+
+void walt_finish_migrate(struct task_struct *p, struct rq *dest_rq, bool locked)
+{
+	u64 wallclock;
+	unsigned long flags;
+
+	if (!p->on_rq && p->state != TASK_WAKING)
+		return;
+
+	if (locked == false)
+		raw_spin_lock_irqsave(&dest_rq->lock, flags);
+
+	lockdep_assert_held(&dest_rq->lock);
+
+	wallclock = walt_ktime_clock();
+
+	/* Update counters on destination CPU */
+	walt_update_task_ravg(dest_rq->curr, dest_rq,
+				TASK_UPDATE, wallclock, 0);
+
+	/* We may be in a new window. Update task counters */
+	walt_update_task_ravg(p, dest_rq, TASK_MIGRATE, wallclock, 0);
+
+	if (p->ravg.curr_window) {
+		if (!dest_rq->window_start) {
+			p->ravg.curr_window = 0;
+			p->ravg.mark_start = 0;
+		}
+		dest_rq->curr_runnable_sum += p->ravg.curr_window;
+	}
+
+	if (p->ravg.prev_window) {
+		if (!dest_rq->window_start)
+			p->ravg.prev_window = 0;
+		dest_rq->prev_runnable_sum += p->ravg.prev_window;
+	}
+
+	if (locked == false)
+		raw_spin_unlock_irqrestore(&dest_rq->lock, flags);
+}
+
+void walt_prepare_migrate(struct task_struct *p, struct rq *src_rq, bool locked)
+{
+	u64 wallclock;
+	unsigned long flags;
+
+	if (!p->on_rq && p->state != TASK_WAKING)
+		return;
+
+	if (locked == false)
+		raw_spin_lock_irqsave(&src_rq->lock, flags);
+
+	lockdep_assert_held(&src_rq->lock);
+
+	/* Note that same wallclock reference is used for all 3 events below */
+	wallclock = walt_ktime_clock();
+
+	/* Update counters on source CPU */
+	walt_update_task_ravg(task_rq(p)->curr, task_rq(p),
+			TASK_UPDATE, wallclock, 0);
+
+	/* Update task's counters */
+	walt_update_task_ravg(p, task_rq(p), TASK_MIGRATE, wallclock, 0);
+
+	/* Fixup busy time */
+	if (p->ravg.curr_window)
+		src_rq->curr_runnable_sum -= p->ravg.curr_window;
+
+	if (p->ravg.prev_window)
+		src_rq->prev_runnable_sum -= p->ravg.prev_window;
+
+	if ((s64)src_rq->prev_runnable_sum < 0) {
+		src_rq->prev_runnable_sum = 0;
+		WARN_ON(1);
+	}
+	if ((s64)src_rq->curr_runnable_sum < 0) {
+		src_rq->curr_runnable_sum = 0;
+		WARN_ON(1);
+	}
+
+	if (locked == false)
+		raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+}
+
+void walt_account_irqtime(int cpu, struct task_struct *curr,
+				u64 delta, u64 wallclock)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	/*
+	 * cputime (wallclock) uses sched_clock so use the same here for
+	 * consistency.
+	 */
+	delta += sched_clock_cpu(cpu) - wallclock;
+
+	walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta);
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+void walt_init_new_task_load(struct task_struct *p)
+{
+	memset(&p->ravg, 0, sizeof(struct ravg));
+}
diff --git a/kernel/sched/walt.h b/kernel/sched/walt.h
new file mode 100644
index 0000000..6dad2e1
--- /dev/null
+++ b/kernel/sched/walt.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __WALT_H
+#define __WALT_H
+
+#ifdef CONFIG_SCHED_WALT
+
+void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
+			   enum task_event event, u64 wallclock, u64 irqtime);
+void walt_prepare_migrate(struct task_struct *p, struct rq *rq, bool locked);
+void walt_finish_migrate(struct task_struct *p, struct rq *rq, bool locked);
+void walt_init_new_task_load(struct task_struct *p);
+void walt_mark_task_starting(struct task_struct *p);
+void walt_set_window_start(struct rq *rq);
+void walt_migrate_sync_cpu(int cpu);
+u64 walt_ktime_clock(void);
+void walt_account_irqtime(int cpu, struct task_struct *curr, u64 delta,
+			  u64 wallclock);
+
+extern unsigned int sysctl_sched_use_walt_metrics;
+extern unsigned int walt_ravg_window;
+
+/* Fold into cpu_util */
+static inline unsigned long cpu_walt_util(struct rq *rq)
+{
+	if (!sysctl_sched_use_walt_metrics)
+		return rq->cfs.avg.util_avg;
+
+	return (rq->prev_runnable_sum * rq->cpu_capacity_orig) /
+					walt_ravg_window;
+}
+
+#else /* CONFIG_SCHED_WALT */
+
+static inline void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
+		int event, u64 wallclock, u64 irqtime) { }
+static inline void walt_prepare_migrate(struct task_struct *p, struct rq *rq,
+					bool locked) { }
+static inline void walt_finish_migrate(struct task_struct *p, struct rq *rq,
+				       bool locked) { }
+static inline void walt_init_new_task_load(struct task_struct *p) { }
+static inline void walt_mark_task_starting(struct task_struct *p) { }
+static inline void walt_set_window_start(struct rq *rq) { }
+static inline void walt_migrate_sync_cpu(int cpu) { }
+static inline u64 walt_ktime_clock(void) { return 0; }
+static inline void walt_account_irqtime(int cpu, struct task_struct *curr,
+				u64 delta, u64 wallclock) { }
+
+static inline unsigned long cpu_walt_util(struct rq *rq)
+{
+	return rq->cfs.avg.util_avg;
+}
+
+#endif /* CONFIG_SCHED_WALT */
+
+extern unsigned int walt_ravg_window;
+
+#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a13bbda..e0f61db 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -439,6 +439,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_SCHED_WALT
+	{
+		.procname	= "sched_use_walt_metrics",
+		.data		= &sysctl_sched_use_walt_metrics,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif /* CONFIG_SCHED_WALT */
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.procname	= "sched_cfs_bandwidth_slice_us",
-- 
TheMan