linux-kernel - Re: lmbench lat_mmap slowdown with CONFIG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090120124500.GE7790@elte.hu>
Date:	Tue, 20 Jan 2009 13:45:00 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>, hpa@...or.com,
	jeremy@...source.com, chrisw@...s-sol.org, zach@...are.com,
	rusty@...tcorp.com.au
Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT


* Nick Piggin <npiggin@...e.de> wrote:

> On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin <npiggin@...e.de> wrote:
> > 
> > > Hi,
> > > 
> > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed 
> > > down. On further investigation, a large part of this is not due to a 
> > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> > > 
> > > Now, it is true that lat_mmap is basically a microbenchmark, however it 
> > > is exercising the memory mapping and page fault handler paths, so we're 
> > > talking about pretty important paths here. So I think it should be of 
> > > interest.
> > > 
> > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to 
> > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my 
> > > scripts needed 100 runs in order to pick up much smaller changes in the 
> > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the 
> > > problem).
> > > 
> > > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> > > 
> > > non pv   AVG=464.22 STD=5.56
> > > paravirt AVG=502.87 STD=7.36
> > > 
> > > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > > people are testing the speed of their PV implementations against non-PV 
> > > bare metal :)
> > 
> > Ouch, that looks unacceptably expensive. All the major distros turn 
> > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> > promise to have no measurable runtime overhead.
> > 
> > ( And i suspect the real life mmap cost is probably even more expensive,
> >   as on a Barcelona all of lmbench fits into the cache hence we dont see
> >   any real $cache overhead. )
> 
> The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
> kernel/. Definitely we don't see the worst of the icache or branch buffer 
> overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )
> 
>  
> > Jeremy, any ideas where this slowdown comes from and how it could be 
> > fixed?
> 
> I had a bit of a poke around the profiles, but nothing stood out. 
> However oprofile counted 50% more cycles in the kernel with PV than with 
> non-PV. I'll have to take a look at the user/system times, because 50% 
> seems ludicrous.... hopefully it's just oprofile noise.

If you have a Core2 test-system could you please try tip/master, which 
also has your do_page_fault-de-bloating patch applied?

<plug>

The other advantage of tip/master would be that you could try precise 
performance counter measurements via:

   http://redhat.com/~mingo/perfcounters/timec.c

and split out the lmbench test-case into a standalone .c file loop. 
Running it as:

   $ taskset 0 ./timec -e -5,-4,-3,0,1,2,3 ./mmap-test

Will give you very precise information about what's going on in that 
workload:

 Performance counter stats for 'mmap-test':

  628315.871980  task clock ticks     (msecs)

          42330  CPU migrations       (events)
         124980  context switches     (events)
       18698292  pagefaults           (events)
  1351875946010  CPU cycles           (events)
  1121901478363  instructions         (events)
    10654788968  cache references     (events)
      633581867  cache misses         (events)

You might also want to try an NMI profile via kerneltop:

   http://redhat.com/~mingo/perfcounters/kerneltop.c

just run it with no arguments on a perfcounters kernel and it will give 
you something like:

 ------------------------------------------------------------------------------
  KernelTop:   20297 irqs/sec  [NMI, 10000 cache-misses],  (all, 8 CPUs)
 ------------------------------------------------------------------------------

              events         RIP          kernel function
   ______     ______   ________________   _______________

            12816.00 - ffffffff803d5760 : copy_user_generic_string!
            11751.00 - ffffffff80647a2c : unix_stream_recvmsg
            10215.00 - ffffffff805eda5f : sock_alloc_send_skb
             9738.00 - ffffffff80284821 : flush_free_list
             6749.00 - ffffffff802854a1 : __kmalloc_track_caller
             3663.00 - ffffffff805f09fa : skb_dequeue
             3591.00 - ffffffff80284be2 : kmem_cache_alloc       [qla2xxx]
             3501.00 - ffffffff805f15f5 : __alloc_skb
             1296.00 - ffffffff803d8eb4 : list_del       [qla2xxx]
             1110.00 - ffffffff805f0ed2 : kfree_skb

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/