linux-kernel - Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160812121010.GA30199@redhat.com>
Date:	Fri, 12 Aug 2016 14:10:11 +0200
From:	Stanislaw Gruszka <sgruszka@...hat.com>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Giovanni Gherdovich <ggherdovich@...e.cz>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Mike Galbraith <mgalbraith@...e.de>,
	linux-kernel@...r.kernel.org,
	Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in
 times()/clock_gettime()

Hi

On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?
> 
> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I 
> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert 
> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to 
> make the result correct, just see what the performance gap is, roughly.
> 
> If there's still a significant gap then it might make sense to optimize this some 
> more.

I measured (partial) revert performance on 4.7 using mmtest instructions
from Giovanni and also tested some other possible fix (draft version):

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..54fdf6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned int seq, nextseq;
 	unsigned long flags;
 
+	(void) task_sched_runtime(tsk);
+
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
 	nextseq = 0;
@@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += t->se.sum_exec_runtime;
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;
---
mmtest benchmark results are below (full compare-kernels.sh output is in attachment):

vanila-4.7            revert                prefetch              patch
4.74 (  0.00%)        3.04 ( 35.93%)        4.09 ( 13.81%)        1.30 ( 72.59%)
5.49 (  0.00%)        5.00 (  8.97%)        5.34 (  2.72%)        1.03 ( 81.16%)
6.12 (  0.00%)        4.91 ( 19.73%)        5.97 (  2.40%)        0.90 ( 85.27%)
6.68 (  0.00%)        4.90 ( 26.66%)        6.02 (  9.75%)        0.88 ( 86.89%)
7.21 (  0.00%)        5.13 ( 28.85%)        6.70 (  7.09%)        0.87 ( 87.91%)
7.66 (  0.00%)        5.22 ( 31.80%)        7.17 (  6.39%)        0.92 ( 88.01%)
7.91 (  0.00%)        5.36 ( 32.22%)        7.30 (  7.72%)        0.95 ( 87.97%)
7.95 (  0.00%)        5.35 ( 32.73%)        7.34 (  7.66%)        1.06 ( 86.66%)
8.00 (  0.00%)        5.33 ( 33.31%)        7.38 (  7.73%)        1.13 ( 85.82%)
5.61 (  0.00%)        3.55 ( 36.76%)        4.53 ( 19.23%)        2.29 ( 59.28%)
5.66 (  0.00%)        4.32 ( 23.79%)        4.75 ( 16.18%)        3.65 ( 35.46%)
5.98 (  0.00%)        4.97 ( 16.87%)        5.96 (  0.35%)        3.62 ( 39.40%)
6.58 (  0.00%)        4.94 ( 24.93%)        6.04 (  8.32%)        3.63 ( 44.89%)
7.19 (  0.00%)        5.18 ( 28.01%)        6.68 (  7.13%)        3.65 ( 49.22%)
7.67 (  0.00%)        5.27 ( 31.29%)        7.16 (  6.63%)        3.62 ( 52.76%)
7.88 (  0.00%)        5.36 ( 31.98%)        7.28 (  7.58%)        3.65 ( 53.71%)
7.99 (  0.00%)        5.39 ( 32.52%)        7.40 (  7.42%)        3.65 ( 54.25%)

Patch works because we we update sum_exec_runtime on current thread
what assure we see proper sum_exec_runtime value on different CPUs. I
tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
patch did not break them. I'm going to run some other test.

Patch is draft version for early review, task_sched_runtime() will be
simplified (since it's called only current thread) and possibly split
into two functions: one that call update_curr() and other that return
sum_exec_runtime (assure it's consistent on 32 bit arches).

Stanislaw

View attachment "compare.txt" of type "text/plain" (27654 bytes)