linux-kernel - Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 3 Aug 2013 15:37:46 +0800
From:	ethan <ethan.kernel@...il.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>, johlstei@...eaurora.org,
	Yinghai Lu <yinghai@...nel.org>, Jin Feng <joe.jin@...cle.com>
Subject: Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer

Peter and tglx,
  Some other tough hacking and testing with result FYI,
With the default kernel 2.6.32-279.19.1.el6.x86_64 in CentOS 6.3 running on my ASUS 4 core Intel i5 server, almost got the best performance of 
tool http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test-1m.c

[root@...alhost ~]# time ./pipe-test-1m

real	0m7.704s
user	0m0.047s
sys	0m4.815s
[root@...alhost ~]# time ./pipe-test-1m

real	0m8.000s
user	0m0.071s
sys	0m5.035s
[root@...alhost ~]# time ./pipe-test-1m

real	0m7.386s
user	0m0.086s
sys	0m4.591s
[root@...alhost ~]# time ./pipe-test-1m

real	0m7.919s
user	0m0.064s
sys	0m4.912s
[root@...alhost ~]# time ./pipe-test-1m

real	0m7.949s
user	0m0.083s
sys	0m4.917s

[root@...alhost ~]# time ./pipe-test-1m
rrr
real	0m7.913s
user	0m0.070s
sys	0m4.903s
[root@...alhost ~]# time ./pipe-test-1m

real	0m7.953s
user	0m0.092s
sys	0m4.881s
[root@...alhost ~]# time ./pipe-test-1m

real	0m8.059s
user	0m0.108s
sys	0m5.037s
[root@...alhost ~]# 

Then compiled and boot stable 3.11.0-rc3 with default configuration, redid the same test. got very bad performance:
root@...alhost ~]# uname -a
Linux localhost 3.11.0-rc3 #4 SMP Wed Jul 31 16:10:56 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux


real	0m10.730s
user	0m0.245s
sys	0m6.596s
[root@...alhost ~]# time ./pipe-test-1m

real	0m10.661s
user	0m0.218s
sys	0m6.520s
[root@...alhost ~]# time ./pipe-test-1m

real	0m10.699s
user	0m0.233s
sys	0m6.534s
[root@...alhost ~]# time ./pipe-test-1m

real	0m10.616s
user	0m0.191s
sys	0m6.505s
[root@...alhost ~]# time ./pipe-test-1m

real	0m10.546s
user	0m0.214s
sys	0m6.441s

[root@...alhost ~]# time ./pipe-test-1m

real	0m10.631s
user	0m0.204s
sys	0m6.509s

First 'tough' hacking is disable the reprogramming in _remove_hrtimer() within 3.11-rc3 code and redo the test.
much better.

root@...alhost ~]# time ./pipe-test-1m

real	0m9.447s
user	0m0.227s
sys	0m5.900s
[root@...alhost ~]# time ./pipe-test-1m

real	0m9.507s
user	0m0.226s
sys	0m5.922s
[root@...alhost ~]# time ./pipe-test-1m

real	0m9.495s
user	0m0.228s
sys	0m5.916s
[root@...alhost ~]# time ./pipe-test-1m

real	0m9.470s
user	0m0.229s
sys	0m5.938s
[root@...alhost ~]# time ./pipe-test-1m

real	0m9.484s
user	0m0.269s
sys	0m5.875s
[root@...alhost ~]# time ./pipe-test-1m

real	0m9.328s
user	0m0.242s
sys	0m5.767s

While I monitor the wake-up with powertop, got
Top causes for wakeups:
  98.5% (  inf)      <kernel IPI> : Rescheduling interrupts
   0.5% (  inf)         swapper/3 : hrtimer_start_range_ns (tick_sched_timer)
   0.3% (  inf)         swapper/2 : hrtimer_start_range_ns (tick_sched_timer)
   0.2% (  inf)         swapper/1 : hrtimer_start_range_ns (tick_sched_timer)
   0.2% (  inf)         swapper/0 : hrtimer_start_range_ns (tick_sched_timer)

So I did the second tough hacking, commented out the rescheduling IPI sending in following function and re-did the test.

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 4137890..c27f04f 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -137,7 +137,7 @@ static inline void play_dead(void)
 
 static inline void smp_send_reschedule(int cpu)
 {
-       smp_ops.smp_send_reschedule(cpu);
+       /* smp_ops.smp_send_reschedule(cpu); */
 }

Got the performance as best as 2.6.32 kernel and the scheduling seems also OK.

root@...alhost ~]# time ./pipe-test-1m

real    0m7.661s
user    0m0.179s
sys     0m4.880s
[root@...alhost ~]# time ./pipe-test-1m

real    0m7.473s
user    0m0.189s
sys     0m4.782s
[root@...alhost ~]# time ./pipe-test-1m

real    0m7.658s
user    0m0.195s
sys     0m4.899s
[root@...alhost ~]# time ./pipe-test-1m

real    0m7.644s
user    0m0.194s
sys     0m4.941s
[root@...alhost ~]# time ./pipe-test-1m

real    0m7.694s
user    0m0.189s
sys     0m4.925s
[root@...alhost ~]# time ./pipe-test-1m

real    0m7.694s
user    0m0.197s
sys     0m4.915s
[root@...alhost ~]# time ./pipe-test-1m

real    0m7.597s
user    0m0.190s
sys     0m4.886s

The the two processes of pipe-test-1m and its child seem could be balanced from cpu0 to cpu3 well,
#top   
f  J
14888 root      20   0   68    0 R 73.2  0.0   0:03.22 2 pip1m
14887 root      20   0  284  224 S 63.4  0.0   0:03.23 0 pip1m

And so the above tough hacking and test basicly show the No.1 expensive thing is the rescheduling IPI, and
the No.2 expensive thing is the extra hrtimer reprogramming/tick in Linux 3.11-rc3 code.
We need manage to do as less as possible rescheduling IPI and reprogramming to get better performance.
Does it(the tough hacking and the test) make sense ? and the result rational ?


Thanks,
Ethan



在 2013-7-30，下午7:59，Peter Zijlstra <peterz@...radead.org> 写道：

> On Tue, Jul 30, 2013 at 07:44:03PM +0800, Ethan Zhao wrote:
>> Got it.
>> what tglx and you mean
>> 
>> 
>> So the expensive thing maybe not inside the schedule(), but could
>> outside the scheduler(), the more bigger forever loop.
>> 
>> This is one part of what I am facing.
> 
> Right, so it would be good if you could further diagnose the problem so
> we can come up with a solution that cures the problem while retaining
> the current 'desired' properties.
> 
> The patch you pinpointed caused a regression in that it would wake from
> NOHZ mode far too often. Could it be that the now longer idle sections
> cause your CPU to go into deeper idle modes and you're suffering from
> idle-exit latencies?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/