[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20110427193419.D17F.A69D9226@jp.fujitsu.com>
Date: Wed, 27 Apr 2011 19:32:21 +0900 (JST)
From: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: kosaki.motohiro@...fujitsu.com,
LKML <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Galbraith <efault@....de>, Ingo Molnar <mingo@...e.hu>
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()
> > But why? Are we going to get rid of cpumask_t (which is a fixed sized
> > struct to direct assignment is perfectly fine)?
> >
> > Also, do we want to convert cpus_allowed to cpumask_var_t? It would save
> > quite a lot of memory on distro configs that set NR_CPUS silly high.
> > Currently NR_CPUS=4096 configs allocate 512 bytes per task for this
> > bitmap, 511 of which will never be used on most machines (510 in the
> > near future).
> >
> > The cost if of course an extra memory dereference in scheduler hot
> > paths.. also not nice.
Probably, mesurement data is verbose than my poor english...
I've made concept proof patch today. The result is better than I expected.
<before>
Performance counter stats for 'hackbench 10 thread 1000' (10 runs):
1603777813 cache-references # 56.987 M/sec ( +- 1.824% ) (scaled from 25.36%)
13780381 cache-misses # 0.490 M/sec ( +- 1.360% ) (scaled from 25.55%)
24872032348 L1-dcache-loads # 883.770 M/sec ( +- 0.666% ) (scaled from 25.51%)
640394580 L1-dcache-load-misses # 22.755 M/sec ( +- 0.796% ) (scaled from 25.47%)
14.162411769 seconds time elapsed ( +- 0.675% )
<after>
Performance counter stats for 'hackbench 10 thread 1000' (10 runs):
1416147603 cache-references # 51.566 M/sec ( +- 4.407% ) (scaled from 25.40%)
10920284 cache-misses # 0.398 M/sec ( +- 5.454% ) (scaled from 25.56%)
24666962632 L1-dcache-loads # 898.196 M/sec ( +- 1.747% ) (scaled from 25.54%)
598640329 L1-dcache-load-misses # 21.798 M/sec ( +- 2.504% ) (scaled from 25.50%)
13.812193312 seconds time elapsed ( +- 1.696% )
* datail data is in result.txt
The trick is,
- Typical linux userland applications don't use mempolicy and/or cpusets
API at all.
- Then, 99.99% thread's tsk->cpus_alloed have cpu_all_mask.
- cpu_all_mask case, every thread can share the same bitmap. It may help to
reduce L1 cache miss in scheduler.
What do you think?
Download attachment "result.txt" of type "application/octet-stream" (3177 bytes)
Download attachment "result.txt" of type "application/octet-stream" (3177 bytes)
Download attachment "0001-s-task-cpus_allowed-tsk_cpus_allowed.patch" of type "application/octet-stream" (19412 bytes)
Download attachment "0002-change-task-cpus_allowed-to-pointer.patch" of type "application/octet-stream" (8123 bytes)
Powered by blists - more mailing lists