linux-kernel - Re: 2.6.19-rt14 slowdown compared to 2.6.19

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20061223105937.GA21172@elte.hu>
Date:	Sat, 23 Dec 2006 11:59:37 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	"Chen, Tim C" <tim.c.chen@...el.com>
Cc:	linux-kernel@...r.kernel.org,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: 2.6.19-rt14 slowdown compared to 2.6.19

* Chen, Tim C <tim.c.chen@...el.com> wrote:

> Ingo,
>  
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19 
> kernel and noticed several slowdowns.  The test machine is a 2 socket 
> woodcrest machine with your default configuration.

cool - thanks for the feedback! Running the 64-bit kernel, right?

> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client each 
> bound to separate cpu cores on different socket, network loopback mode 
> was used).
> 
> Volanomark was slower by 80% (Server and Clients communicate with 
> network loopback mode. Idle time goes from 1% to 60%)
> 
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
> 
> Wonder if you have any suggestions on what could cause the slowdown.  
> We've tried disabling CONFIG_NO_HZ and it didn't help much.

while some slowdown is to be expected, did in each case idle time 
increase significantly? If yes then this is the effect of lock 
contention. Lock contention effects are 'magnified' by PREEMPT_RT. For 
example if you run 128 threads workload that all use the same lock then 
the -rt kernel can act as if it were a 128-way box (!). This way by 
running -rt you'll see scalability problems alot sooner than on real 
hardware. In other words: PREEMPT_RT in essence simulates the 
scalability behavior of up to an infinite amount of CPUs. (with the 
exception of cachemiss emulation ;) [the effect is not this precise, but 
that's the rough trend]

the good news: any fix for such problems in -rt is useful for the 
upstream kernel too, because it improves scalability on real hardware as 
well. The bad news: this work hasnt been finished yet ;-)

We've fixed a couple of contention points discovered by -rt this way, in 
particular Peter's excellent scalability patches at:

  http://programming.kicks-ass.net/kernel-patches/

were done based on lock-contention profiling feedback. I've included the 
lockless pagecache and the s_files scalability patches which were very 
useful already. The lru_cache_add one is the next one for me to try, 
plus on i686 -rt kernels highmem locking hurts, hence the 
highmem_rewrite.patch (not in -rt yet either). Plus there's the really 
exciting concurrent-pagecache patch-queue from Peter, which will be 
interesting to try too.

If you'd like to profile this yourself then the lowest-cost way of 
profiling lock contention on -rt is to use the yum kernel and run the 
attached trace-it-lock-prof.c code on the box while your workload is in 
'steady state' (and is showing those extended idle times):

  ./trace-it-lock-prof > trace.txt

this captures up to 1 second worth of system activity, on the current 
CPU. Then you can construct the histogram via:

  grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
                                  uniq -c | sort -n > prof.txt

i just did this for "hackbench 20", and it gave me:

     38  kfree()<-skb_release_data()<-kfree_skbmem()<-(-1)()
     40  kmem_cache_free()<-kfree_skbmem()<-__kfree_skb()<-(-1)()
     47  __schedule()<-preempt_schedule_irq()<-retint_kernel()<-__spin_unlock_irqrestore()
    173  rt_read_lock()<-sock_def_readable()<-unix_stream_sendmsg()<-(-1)()
    173  __schedule()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
    274  __schedule()<-schedule()<-posix_cpu_timers_thread()<-kthread()
    277  __schedule()<-schedule()<-ksoftirqd()<-kthread()
    365  kmem_cache_alloc_node()<-__alloc_skb()<-sock_alloc_send_skb()<-(-1)()
    366  kmem_cache_alloc_node()<-__kmalloc_node()<-__alloc_skb()<-(-1)()
    814  __schedule()<-schedule()<-rt_spin_lock_slowlock()<-rt_spin_lock()
   1819  __schedule()<-schedule()<-int_careful()<-<ffffe405>()
   2346  __schedule()<-schedule()<-schedule_timeout()<-unix_stream_recvmsg()
   2346  sock_aio_read()<-do_sync_read()<-vfs_read()<-(-1)()

(note the output mixes the two lines of symbolic stack backtrace entries 
that are generated by the tracer, so the last two lines for example are 
for the same contention/preemption point.)

this means that the biggest contention points (or natural rescheduling 
points) in this workload are sock_aio_read()->unix_stream_recvmsg() - 
not unexpected. But there's also ~10% of contention in the kmalloc code, 
meaning that for this workload the per-CPU slabs are probably set a bit 
too low. There's also ~5% of contention from sock_def_readable()'s use 
of an rwlock.

with more CPUs and more tasks you'll see these effects in a more drastic 
way.

	Ingo

View attachment "trace-it-lock-prof.c" of type "text/plain" (2531 bytes)