lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 11 May 2011 08:59:10 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Nikhil Rao <ncrao@...gle.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Mike Galbraith <efault@....de>, linux-kernel@...r.kernel.org,
	"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>,
	Stephan Barwolf <stephan.baerwolf@...ilmenau.de>
Subject: Re: [PATCH v1 00/19] Increase resolution of load weights


* Nikhil Rao <ncrao@...gle.com> wrote:

> I think we need the branch for 64-bit kernels. I don't like the branch but I 
> can't think of a better way to avoid it. Do you have any suggestion?

It was just a quick stab into the dark, i was fishing for more 
micro-optimizations on 64-bit (32-bit, as long as we leave its resolution 
alone, should not matter much): clearly there is *some* new overhead on 64-bit 
kernels too, so it would be nice to reduce that to the absolute minimum.

> For 32-bit systems, the compiler should ideally optimize this branch away. 
> Unfortunately gcc-4.4.3 doesn't do that (and I'm not sure if a later version 
> does it either). We could add a macro around this check to avoid the branch 
> for 32-bit and do the check for 64-bit kernels?

I'd rather keep it easy to read. If we keep the 32-bit unit of load a 32-bit 
word then 32-bit will see basically no extra overhead, right? (modulo the 
compiler not noticing such optimizations.)

Also, it's a good idea to do performance measurements with newest gcc (4.6) if 
possible: by the time such a change hits distros it will be the established 
stock distro compiler that kernels get built with. Maybe your figures will get 
better and maybe it can optimize this branch as well.

> > Also, the above (and the other scale-adjustment changes) probably explains 
> > why the instruction count went up on 64-bit.
> 
> Yes, that makes sense. We see an increase in instruction count of about 2% 
> with the new version of the patchset, down from 5.8% (will post the new 
> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is 
> an effective increase of approx. 6.7%. I'll post the data and some analysis 
> along with the new version.

An instruction count increase does not necessarily mean a linear slowdown: if 
those instructions are cheaper or scheduled better by the CPU then often the 
slowdown will be less.

Sometimes a 1% increase in the instruction count can slow down a workload by 
5%, if the 1% increase does divisions, has complex data path dependencies or is 
missing the branch-cache a lot.

So you should keep an eye on the cycle count as well. Latest -tip's perf stat 
can also measure 'stalled cycles':

aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m

 Performance counter stats for './pipe-test-1m' (3 runs):

       6499.787926 task-clock               #    0.437 CPUs utilized            ( +-  0.41% )
         2,000,108 context-switches         #    0.308 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-100.00% )
               147 page-faults              #    0.000 M/sec                    ( +-  0.00% )
    14,226,565,939 cycles                   #    2.189 GHz                      ( +-  0.49% )
     6,897,331,129 stalled-cycles-frontend  #   48.48% frontend cycles idle     ( +-  0.90% )
     4,230,895,459 stalled-cycles-backend   #   29.74% backend  cycles idle     ( +-  1.31% )
    14,002,256,109 instructions             #    0.98  insns per cycle        
                                            #    0.49  stalled cycles per insn  ( +-  0.02% )
     2,703,891,945 branches                 #  415.997 M/sec                    ( +-  0.02% )
        44,994,805 branch-misses            #    1.66% of all branches          ( +-  0.27% )

       14.859234036  seconds time elapsed  ( +-  0.19% )

Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes 
the CPU's resources optimally. Looking at a 'perf record -e 
stalled-cycles-frontend' and 'perf report' will show you the problem areas.
 
Most of the 'problem areas' will be unrelated to your code.

A 'near perfectly utilized' CPU looks like this:

aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b

 Performance counter stats for './fill_1b' (10 runs):

       1880.489837 task-clock               #    0.998 CPUs utilized            ( +-  0.15% )
                36 context-switches         #    0.000 M/sec                    ( +- 19.87% )
                 1 CPU-migrations           #    0.000 M/sec                    ( +- 59.63% )
                99 page-faults              #    0.000 M/sec                    ( +-  0.10% )
     6,027,432,226 cycles                   #    3.205 GHz                      ( +-  0.15% )
        22,138,455 stalled-cycles-frontend  #    0.37% frontend cycles idle     ( +- 36.56% )
        16,400,224 stalled-cycles-backend   #    0.27% backend  cycles idle     ( +- 38.12% )
    18,008,803,113 instructions             #    2.99  insns per cycle        
                                            #    0.00  stalled cycles per insn  ( +-  0.00% )
     1,001,802,536 branches                 #  532.735 M/sec                    ( +-  0.01% )
            22,842 branch-misses            #    0.00% of all branches          ( +-  9.07% )

        1.884595529  seconds time elapsed  ( +-  0.15% )

Both stall counts are very low. This is pretty hard to achieve in general, so 
before/after comparisons are used. For that there's 'perf diff' which you can 
use to compare before/after profiles:

 aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
 [ perf record: Woken up 2 times to write data ]
 [ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
 aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
 [ perf record: Woken up 2 times to write data ]
 [ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
 aldebaran:~/sched-tests> perf diff | head -10
 # Baseline  Delta          Shared Object                         Symbol
 # ........ ..........  .................  .............................
 #
     2.68%     +0.84%  [kernel.kallsyms]  [k] select_task_rq_fair
     3.28%     -0.17%  [kernel.kallsyms]  [k] fsnotify
     2.67%     +0.13%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     2.46%     +0.11%  [kernel.kallsyms]  [k] pipe_read
     2.42%             [kernel.kallsyms]  [k] schedule
     2.11%     +0.28%  [kernel.kallsyms]  [k] copy_user_generic_string
     2.13%     +0.18%  [kernel.kallsyms]  [k] mutex_lock

 ( Note: these were two short runs on the same kernel so the diff shows the 
   natural noise of the profile of this workload. Longer runs are needed to 
   measure effects smaller than 1%. )

So there's a wide range of tools you can use to understand the precise 
performance impact of your patch and in turn you can present to us what you 
learned about it.

Such analysis saves quite a bit of time on the side of us scheduler maintainers 
and makes performance impacting patches a lot more easy to apply :-)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ