linux-kernel - Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20110427193419.D17F.A69D9226@jp.fujitsu.com>
Date:	Wed, 27 Apr 2011 19:32:21 +0900 (JST)
From:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	kosaki.motohiro@...fujitsu.com,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mike Galbraith <efault@....de>, Ingo Molnar <mingo@...e.hu>
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

> > But why? Are we going to get rid of cpumask_t (which is a fixed sized
> > struct to direct assignment is perfectly fine)?
> > 
> > Also, do we want to convert cpus_allowed to cpumask_var_t? It would save
> > quite a lot of memory on distro configs that set NR_CPUS silly high.
> > Currently NR_CPUS=4096 configs allocate 512 bytes per task for this
> > bitmap, 511 of which will never be used on most machines (510 in the
> > near future).
> >
> > The cost if of course an extra memory dereference in scheduler hot
> > paths.. also not nice.

Probably, mesurement data is verbose than my poor english...

I've made concept proof patch today. The result is better than I expected.

<before>
 Performance counter stats for 'hackbench 10 thread 1000' (10 runs):

         1603777813  cache-references         #     56.987 M/sec   ( +-   1.824% )  (scaled from 25.36%)
           13780381  cache-misses             #      0.490 M/sec   ( +-   1.360% )  (scaled from 25.55%)
        24872032348  L1-dcache-loads          #    883.770 M/sec   ( +-   0.666% )  (scaled from 25.51%)
          640394580  L1-dcache-load-misses    #     22.755 M/sec   ( +-   0.796% )  (scaled from 25.47%)

       14.162411769  seconds time elapsed   ( +-   0.675% )

<after>
 Performance counter stats for 'hackbench 10 thread 1000' (10 runs):

         1416147603  cache-references         #     51.566 M/sec   ( +-   4.407% )  (scaled from 25.40%)
           10920284  cache-misses             #      0.398 M/sec   ( +-   5.454% )  (scaled from 25.56%)
        24666962632  L1-dcache-loads          #    898.196 M/sec   ( +-   1.747% )  (scaled from 25.54%)
          598640329  L1-dcache-load-misses    #     21.798 M/sec   ( +-   2.504% )  (scaled from 25.50%)

       13.812193312  seconds time elapsed   ( +-   1.696% )

 * datail data is in result.txt


The trick is,
 - Typical linux userland applications don't use mempolicy and/or cpusets
   API at all.
 - Then, 99.99% thread's  tsk->cpus_alloed have cpu_all_mask.
 - cpu_all_mask case, every thread can share the same bitmap. It may help to
   reduce L1 cache miss in scheduler.

What do you think?


Download attachment "result.txt" of type "application/octet-stream" (3177 bytes)

Download attachment "result.txt" of type "application/octet-stream" (3177 bytes)

Download attachment "0001-s-task-cpus_allowed-tsk_cpus_allowed.patch" of type "application/octet-stream" (19412 bytes)

Download attachment "0002-change-task-cpus_allowed-to-pointer.patch" of type "application/octet-stream" (8123 bytes)