netdev - Re: [PATCH] poll: Avoid extra wakeups in select/poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49F9B0F0.40306@cosmosbay.com>
Date:	Thu, 30 Apr 2009 16:08:48 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Ingo Molnar <mingo@...e.hu>
CC:	Christoph Lameter <cl@...ux.com>,
	linux kernel <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	David Miller <davem@...emloft.net>, jesse.brandeburg@...el.com,
	netdev@...r.kernel.org, haoki@...hat.com, mchan@...adcom.com,
	davidel@...ilserver.org
Subject: Re: [PATCH] poll: Avoid extra wakeups in select/poll

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@...mosbay.com> wrote:
> 
>> Ingo Molnar a écrit :
>>> * Eric Dumazet <dada1@...mosbay.com> wrote:
>>>
>>>> On uddpping, I had prior to the patch about 49000 wakeups per 
>>>> second, and after patch about 26000 wakeups per second (matches 
>>>> number of incoming udp messages per second)
>>> very nice. It might not show up as a real performance difference if 
>>> the CPUs are not fully saturated during the test - but it could show 
>>> up as a decrease in CPU utilization.
>>>
>>> Also, if you run the test via 'perf stat -a ./test.sh' you should 
>>> see a reduction in instructions executed:
>>>
>>> aldebaran:~/linux/linux> perf stat -a sleep 1
>>>
>>>  Performance counter stats for 'sleep':
>>>
>>>    16128.045994  task clock ticks     (msecs)
>>>           12876  context switches     (events)
>>>             219  CPU migrations       (events)
>>>          186144  pagefaults           (events)
>>>     20911802763  CPU cycles           (events)
>>>     19309416815  instructions         (events)
>>>       199608554  cache references     (events)
>>>        19990754  cache misses         (events)
>>>
>>>  Wall-clock time elapsed:  1008.882282 msecs
>>>
>>> With -a it's measured system-wide, from start of test to end of test 
>>> - the results will be a lot more stable (and relevant) statistically 
>>> than wall-clock time or CPU usage measurements. (both of which are 
>>> rather imprecise in general)
>> I tried this perf stuff and got strange results on a cpu burning 
>> bench, saturating my 8 cpus with a "while (1) ;" loop
>>
>>
>> # perf stat -a sleep 10
>>
>>  Performance counter stats for 'sleep':
>>
>>    80334.709038  task clock ticks     (msecs)
>>           80638  context switches     (events)
>>               4  CPU migrations       (events)
>>             468  pagefaults           (events)
>>    160694681969  CPU cycles           (events)
>>    160127154810  instructions         (events)
>>          686393  cache references     (events)
>>          230117  cache misses         (events)
>>
>>  Wall-clock time elapsed: 10041.531644 msecs
>>
>> So its about 16069468196 cycles per second for 8 cpus
>> Divide by 8 to get 2008683524 cycles per second per cpu,
>> which is not       3000000000  (E5450  @ 3.00GHz)
> 
> What does "perf stat -l -a sleep 10" show? I suspect your counters 
> are scaled by about 67%, due to counter over-commit. -l will show 
> the scaling factor (and will scale up the results).

Only difference I see with '-l' is cache misses not counted.

(tbench 8 running, so not one instruction per cycle)

# perf stat -l -a sleep 10

 Performance counter stats for 'sleep':

   80007.128844  task clock ticks     (msecs)
        6754642  context switches     (events)
              2  CPU migrations       (events)
            474  pagefaults           (events)
   160925719143  CPU cycles           (events)
   108482003620  instructions         (events)
     7584035056  cache references     (events)
  <not counted>  cache misses

 Wall-clock time elapsed: 10000.595448 msecs

# perf stat -a sleep 10

 Performance counter stats for 'sleep':

   80702.908287  task clock ticks     (msecs)
        6792588  context switches     (events)
             24  CPU migrations       (events)
           4867  pagefaults           (events)
   161957342744  CPU cycles           (events)
   109147553984  instructions         (events)
     7633190481  cache references     (events)
       22996234  cache misses         (events)

 Wall-clock time elapsed: 10087.502391 msecs



> 
> If so then i think this behavior is confusing, and i'll make -l 
> default-enabled. (in fact i just committed this change to latest 
> -tip and pushed it out)
> 
> To get only instructions and cycles, do:
> 
>    perf stat -e instructions -e cycles
> 

# perf stat -e instructions -e cycles -a sleep 10

 Performance counter stats for 'sleep':

   109469842392  instructions         (events)
   162012922122  CPU cycles           (events)

 Wall-clock time elapsed: 10124.943544 msecs

I am wondering if cpus are not running at 2 GHz ;)


>> It seems strange a "jmp myself" uses one unhalted cycle per 
>> instruction and 0.5 halted cycle ...
>>
>> Also, after using "perf stat", tbench results are 1778 MB/S 
>> instead of 2610 MB/s. Even if no perf stat running.
> 
> Hm, that would be a bug. Could you send the dmesg output of:
> 
>    echo p > /proc/sysrq-trigger 
>    echo p > /proc/sysrq-trigger 
> 
> with counters running it will show something like:
> 
> [  868.105712] SysRq : Show Regs
> [  868.106544] 
> [  868.106544] CPU#1: ctrl:       ffffffffffffffff
> [  868.106544] CPU#1: status:     0000000000000000
> [  868.106544] CPU#1: overflow:   0000000000000000
> [  868.106544] CPU#1: fixed:      0000000000000000
> [  868.106544] CPU#1: used:       0000000000000000
> [  868.106544] CPU#1:   gen-PMC0 ctrl:  00000000001300c0
> [  868.106544] CPU#1:   gen-PMC0 count: 000000ffee889194
> [  868.106544] CPU#1:   gen-PMC0 left:  0000000011e1791a
> [  868.106544] CPU#1:   gen-PMC1 ctrl:  000000000013003c
> [  868.106544] CPU#1:   gen-PMC1 count: 000000ffd2542438
> [  868.106544] CPU#1:   gen-PMC1 left:  000000002dd17a8e

They stay fix (but only CPU#0 is displayed)

Is perf able to display per cpu counters, and not aggregated values ?

[ 7894.426787] CPU#0: ctrl:       ffffffffffffffff
[ 7894.426788] CPU#0: status:     0000000000000000
[ 7894.426790] CPU#0: overflow:   0000000000000000
[ 7894.426792] CPU#0: fixed:      0000000000000000
[ 7894.426793] CPU#0: used:       0000000000000000
[ 7894.426796] CPU#0:   gen-PMC0 ctrl:  0000000000134f2e
[ 7894.426798] CPU#0:   gen-PMC0 count: 000000ffb91e31e1
[ 7894.426799] CPU#0:   gen-PMC0 left:  000000007fffffff
[ 7894.426802] CPU#0:   gen-PMC1 ctrl:  000000000013412e
[ 7894.426804] CPU#0:   gen-PMC1 count: 000000ff80312b23
[ 7894.426805] CPU#0:   gen-PMC1 left:  000000007fffffff
[ 7894.426807] CPU#0: fixed-PMC0 count: 000000ffacf54a68
[ 7894.426809] CPU#0: fixed-PMC1 count: 000000ffb71cfe02
[ 7894.426811] CPU#0: fixed-PMC2 count: 0000000000000000
[ 7905.522262] SysRq : Show Regs
[ 7905.522277]
[ 7905.522279] CPU#0: ctrl:       ffffffffffffffff
[ 7905.522281] CPU#0: status:     0000000000000000
[ 7905.522283] CPU#0: overflow:   0000000000000000
[ 7905.522284] CPU#0: fixed:      0000000000000000
[ 7905.522286] CPU#0: used:       0000000000000000
[ 7905.522288] CPU#0:   gen-PMC0 ctrl:  0000000000134f2e
[ 7905.522290] CPU#0:   gen-PMC0 count: 000000ffb91e31e1
[ 7905.522292] CPU#0:   gen-PMC0 left:  000000007fffffff
[ 7905.522294] CPU#0:   gen-PMC1 ctrl:  000000000013412e
[ 7905.522296] CPU#0:   gen-PMC1 count: 000000ff80312b23
[ 7905.522298] CPU#0:   gen-PMC1 left:  000000007fffffff
[ 7905.522299] CPU#0: fixed-PMC0 count: 000000ffacf54a68
[ 7905.522301] CPU#0: fixed-PMC1 count: 000000ffb71cfe02
[ 7905.522303] CPU#0: fixed-PMC2 count: 0000000000000000


> 
> the counts should stay put (i.e. all counters should be disabled). 
> If they move around - despite there being no 'perf stat -a' session 
> running, that would be a bug.

I rebooted my machine then got good results

# perf stat -e instructions -e cycles -a sleep 10

 Performance counter stats for 'sleep':

   240021659058  instructions         (events)
   240997984836  CPU cycles           (events) << OK >>

 Wall-clock time elapsed: 10041.499326 msecs

But if I use plain "perf stat -a sleep 10"
it seems I get wrong values again (16 G cycles/sec) for all next perf sessions


# perf stat -a sleep 10

 Performance counter stats for 'sleep':

   80332.718661  task clock ticks     (msecs)
          80602  context switches     (events)
              4  CPU migrations       (events)
            473  pagefaults           (events)
   160665397757  CPU cycles           (events)  << bad >>
   160079277974  instructions         (events)
         793665  cache references     (events)
         226829  cache misses         (events)

 Wall-clock time elapsed: 10041.302969 msecs

# perf stat -e cycles -a sleep 10

 Performance counter stats for 'sleep':

   160665176575  CPU cycles           (events)  << bad >>

 Wall-clock time elapsed: 10041.491503 msecs


> 
> Also, the overhead might be profile-able, via:
> 
> 	perf record -m 1024 sleep 10
> 
> (this records the profile into output.perf.)
> 
> followed by:
> 
> 	./perf-report | tail -20
> 
> to display a histogram, with kernel-space and user-space symbols 
> mixed into a single profile.
> 
> (Pick up latest -tip to get perf-report built by default.)

Thanks, this is what I use currently

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html