linux-kernel - [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160817093043.GA25206@redhat.com>
Date:	Wed, 17 Aug 2016 11:30:44 +0200
From:	Stanislaw Gruszka <sgruszka@...hat.com>
To:	linux-kernel@...r.kernel.org
Cc:	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Mike Galbraith <mgalbraith@...e.de>,
	Giovanni Gherdovich <ggherdovich@...e.cz>,
	Mel Gorman <mgorman@...hsingularity.net>
Subject: [PATCH] sched/cputime: do not account thread group tasks pending
 runtime to improve performance

Commit d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles") makes we
account thread group tasks pending runtime in thread_group_cputime().
Another commit 6e998916dfe32 ("sched/cputime:
Fix clock_nanosleep()/clock_gettime() inconsistency") makes we update
scheduler runtime statistics (call update_curr()) when read task pending
runtime. Those changes cause bad performance of times() and
clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls.

While we would like to have cpuclock monotonicity kept i.e. have
problems fixed by above commits stay fixed, we also would like to have
good performance.

However when we notice that change from commit d670ec13178d0 is not
longer needed to solve problem addressed by that commit, because of
change from the second commit 6e998916dfe32, we can get room for
optimization. Since we update task while reading it's pending runtime
in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
see updated values and on testcase from d670ec13178d0 process cpuclock
will not be smaller than thread cpuclock.

I tested the patch on testcases from commits d670ec13178d0,
6e998916dfe32 and some other cpuclock/cputimers testcases and
did not found cpuclock monotonicity problems or other mallfunction.

Patch has drawback that we will not provide thread group cputime
up-to-date to the last moment. For example when arming cputime timer,
we will arm it with possibly a bit outdated values and that timer will
trigger earlier compared to behaviour without the patch. However that
was the behaviour before d670ec13178d0 commit (kernel v3.1).

Patch improves related syscalls performance what Giovanni measured
by benchmarks he described in -tip commit 6075620b0590e ("sched/cputime: 
Mitigate performance regression in times()/clock_gettime()").
Giovanni's benchmark results are below:
 
clock_gettime():

threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                       (pre-6e998916dfe3)
2          3.48        2.23 ( 35.68%)        3.06 ( 11.83%)        1.08 ( 68.81%)
5          3.33        2.83 ( 14.84%)        3.25 (  2.40%)        0.71 ( 78.55%)
8          3.37        2.84 ( 15.80%)        3.26 (  3.30%)        0.56 ( 83.49%)
12         3.32        3.09 (  6.69%)        3.37 ( -1.60%)        0.42 ( 87.28%)
21         4.01        3.14 ( 21.70%)        3.90 (  2.74%)        0.35 ( 91.35%)
30         3.63        3.28 (  9.75%)        3.36 (  7.41%)        0.28 ( 92.23%)
48         3.71        3.02 ( 18.69%)        3.11 ( 16.27%)        0.39 ( 89.39%)
79         3.75        2.88 ( 23.23%)        3.16 ( 15.74%)        0.46 ( 87.76%)
110        3.81        2.95 ( 22.62%)        3.25 ( 14.80%)        0.56 ( 85.41%)
128        3.88        3.05 ( 21.28%)        3.31 ( 14.76%)        0.62 ( 84.10%)

times():

threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                       (pre-6e998916dfe3)
2          3.65        2.27 ( 37.94%)        3.25 ( 11.03%)        1.62 ( 55.71%)
5          3.45        2.78 ( 19.34%)        3.17 (  7.92%)        2.33 ( 32.28%)
8          3.52        2.79 ( 20.66%)        3.22 (  8.69%)        2.06 ( 41.44%)
12         3.29        3.02 (  8.33%)        3.36 ( -2.04%)        2.00 ( 39.18%)
21         4.07        3.10 ( 23.86%)        3.92 (  3.78%)        2.07 ( 49.18%)
30         3.87        3.33 ( 13.80%)        3.40 ( 12.17%)        1.89 ( 51.12%)
48         3.79        2.96 ( 21.94%)        3.16 ( 16.61%)        1.69 ( 55.46%)
79         3.88        2.88 ( 25.82%)        3.28 ( 15.42%)        1.60 ( 58.81%)
110        3.90        2.98 ( 23.73%)        3.38 ( 13.35%)        1.73 ( 55.61%)
128        4.00        3.10 ( 22.40%)        3.38 ( 15.45%)        1.66 ( 58.52%)

Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@...e.cz>
Signed-off-by: Stanislaw Gruszka <sgruszka@...hat.com>
---
 kernel/sched/cputime.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 1934f65..4fca604 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -301,6 +301,26 @@ static inline cputime_t account_other_time(cputime_t max)
 	return accounted;
 }
 
+#ifdef CONFIG_64BIT
+static inline u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	return t->se.sum_exec_runtime;
+}
+#else
+static u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	u64 ns;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(t, &rf);
+	ns = t->se.sum_exec_runtime;
+	task_rq_unlock(rq, t, &rf);
+
+	return ns;
+}
+#endif
+
 /*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
@@ -313,6 +333,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned int seq, nextseq;
 	unsigned long flags;
 
+	/*
+	 * Update current task runtime to account pending time since last
+	 * scheduler action or thread_group_cputime() call. This thread group
+	 * might have other running tasks on different CPUs, but updating
+	 * their runtime can affect syscall performance, so we skip account
+	 * those pending times and rely only on values updated on tick or
+	 * other scheduler action.
+	 */
+	if (same_thread_group(current, tsk))
+		(void) task_sched_runtime(current);
+
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
 	nextseq = 0;
@@ -327,7 +358,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;
-- 
1.8.3.1