[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20061223105937.GA21172@elte.hu>
Date: Sat, 23 Dec 2006 11:59:37 +0100
From: Ingo Molnar <mingo@...e.hu>
To: "Chen, Tim C" <tim.c.chen@...el.com>
Cc: linux-kernel@...r.kernel.org,
"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: 2.6.19-rt14 slowdown compared to 2.6.19
* Chen, Tim C <tim.c.chen@...el.com> wrote:
> Ingo,
>
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19
> kernel and noticed several slowdowns. The test machine is a 2 socket
> woodcrest machine with your default configuration.
cool - thanks for the feedback! Running the 64-bit kernel, right?
> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client each
> bound to separate cpu cores on different socket, network loopback mode
> was used).
>
> Volanomark was slower by 80% (Server and Clients communicate with
> network loopback mode. Idle time goes from 1% to 60%)
>
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
>
> Wonder if you have any suggestions on what could cause the slowdown.
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
while some slowdown is to be expected, did in each case idle time
increase significantly? If yes then this is the effect of lock
contention. Lock contention effects are 'magnified' by PREEMPT_RT. For
example if you run 128 threads workload that all use the same lock then
the -rt kernel can act as if it were a 128-way box (!). This way by
running -rt you'll see scalability problems alot sooner than on real
hardware. In other words: PREEMPT_RT in essence simulates the
scalability behavior of up to an infinite amount of CPUs. (with the
exception of cachemiss emulation ;) [the effect is not this precise, but
that's the rough trend]
the good news: any fix for such problems in -rt is useful for the
upstream kernel too, because it improves scalability on real hardware as
well. The bad news: this work hasnt been finished yet ;-)
We've fixed a couple of contention points discovered by -rt this way, in
particular Peter's excellent scalability patches at:
http://programming.kicks-ass.net/kernel-patches/
were done based on lock-contention profiling feedback. I've included the
lockless pagecache and the s_files scalability patches which were very
useful already. The lru_cache_add one is the next one for me to try,
plus on i686 -rt kernels highmem locking hurts, hence the
highmem_rewrite.patch (not in -rt yet either). Plus there's the really
exciting concurrent-pagecache patch-queue from Peter, which will be
interesting to try too.
If you'd like to profile this yourself then the lowest-cost way of
profiling lock contention on -rt is to use the yum kernel and run the
attached trace-it-lock-prof.c code on the box while your workload is in
'steady state' (and is showing those extended idle times):
./trace-it-lock-prof > trace.txt
this captures up to 1 second worth of system activity, on the current
CPU. Then you can construct the histogram via:
grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
uniq -c | sort -n > prof.txt
i just did this for "hackbench 20", and it gave me:
38 kfree()<-skb_release_data()<-kfree_skbmem()<-(-1)()
40 kmem_cache_free()<-kfree_skbmem()<-__kfree_skb()<-(-1)()
47 __schedule()<-preempt_schedule_irq()<-retint_kernel()<-__spin_unlock_irqrestore()
173 rt_read_lock()<-sock_def_readable()<-unix_stream_sendmsg()<-(-1)()
173 __schedule()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
274 __schedule()<-schedule()<-posix_cpu_timers_thread()<-kthread()
277 __schedule()<-schedule()<-ksoftirqd()<-kthread()
365 kmem_cache_alloc_node()<-__alloc_skb()<-sock_alloc_send_skb()<-(-1)()
366 kmem_cache_alloc_node()<-__kmalloc_node()<-__alloc_skb()<-(-1)()
814 __schedule()<-schedule()<-rt_spin_lock_slowlock()<-rt_spin_lock()
1819 __schedule()<-schedule()<-int_careful()<-<ffffe405>()
2346 __schedule()<-schedule()<-schedule_timeout()<-unix_stream_recvmsg()
2346 sock_aio_read()<-do_sync_read()<-vfs_read()<-(-1)()
(note the output mixes the two lines of symbolic stack backtrace entries
that are generated by the tracer, so the last two lines for example are
for the same contention/preemption point.)
this means that the biggest contention points (or natural rescheduling
points) in this workload are sock_aio_read()->unix_stream_recvmsg() -
not unexpected. But there's also ~10% of contention in the kmalloc code,
meaning that for this workload the per-CPU slabs are probably set a bit
too low. There's also ~5% of contention from sock_def_readable()'s use
of an rwlock.
with more CPUs and more tasks you'll see these effects in a more drastic
way.
Ingo
View attachment "trace-it-lock-prof.c" of type "text/plain" (2531 bytes)
Powered by blists - more mailing lists