[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160812121010.GA30199@redhat.com>
Date: Fri, 12 Aug 2016 14:10:11 +0200
From: Stanislaw Gruszka <sgruszka@...hat.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Giovanni Gherdovich <ggherdovich@...e.cz>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Mike Galbraith <mgalbraith@...e.de>,
linux-kernel@...r.kernel.org,
Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in
times()/clock_gettime()
Hi
On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> Nice detective work! I'm wondering, where do we stand if compared with a
> pre-6e998916dfe3 kernel?
>
> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
> make the result correct, just see what the performance gap is, roughly.
>
> If there's still a significant gap then it might make sense to optimize this some
> more.
I measured (partial) revert performance on 4.7 using mmtest instructions
from Giovanni and also tested some other possible fix (draft version):
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..54fdf6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
unsigned int seq, nextseq;
unsigned long flags;
+ (void) task_sched_runtime(tsk);
+
rcu_read_lock();
/* Attempt a lockless read on the first round. */
nextseq = 0;
@@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
task_cputime(t, &utime, &stime);
times->utime += utime;
times->stime += stime;
- times->sum_exec_runtime += task_sched_runtime(t);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
}
/* If lockless access failed, take the lock. */
nextseq = 1;
---
mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
vanila-4.7 revert prefetch patch
4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%)
5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%)
6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%)
6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%)
7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%)
7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%)
7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%)
7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%)
8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%)
5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%)
5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%)
5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%)
6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%)
7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%)
7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%)
7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%)
7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%)
Patch works because we we update sum_exec_runtime on current thread
what assure we see proper sum_exec_runtime value on different CPUs. I
tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
patch did not break them. I'm going to run some other test.
Patch is draft version for early review, task_sched_runtime() will be
simplified (since it's called only current thread) and possibly split
into two functions: one that call update_curr() and other that return
sum_exec_runtime (assure it's consistent on 32 bit arches).
Stanislaw
View attachment "compare.txt" of type "text/plain" (27654 bytes)
Powered by blists - more mailing lists