linux-kernel - Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5356916F.4000205@kernel.dk>
Date:	Tue, 22 Apr 2014 09:57:35 -0600
From:	Jens Axboe <axboe@...nel.dk>
To:	Alexander Gordeev <agordeev@...hat.com>,
	linux-kernel@...r.kernel.org
CC:	Kent Overstreet <kmo@...erainc.com>, Shaohua Li <shli@...nel.org>,
	Nicholas Bellinger <nab@...ux-iscsi.org>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when
 stealing tags

On 04/22/2014 08:03 AM, Jens Axboe wrote:
> On 2014-04-22 01:10, Alexander Gordeev wrote:
>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>> But other systems (more dense?) showed increased cache-hit rate
>>> up to 20%, i.e. this one:
>>
>> Hello Gentlemen,
>>
>> Any feedback on this?
> 
> Sorry for dropping the ball on this. Improvements wrt when to steal, how
> much, and from whom are sorely needed in percpu_ida. I'll do a bench
> with this on a system that currently falls apart with it.

Ran some quick numbers with three kernels:

stock		3.15-rc2
limit		3.15-rc2 + steal limit patch (attached)
limit+ag	3.15-rc2 + steal limit + your topology patch

Two tests were run - the device has an effective queue depth limit of
255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both
exercise near-limit depth and over limit depth. 8 processes were used,
split into two groups. One group would always run on the local node, the
other would be run on the adjacent node (near) or on the far node (far).

Near + low
-----------
		IOPS		sys time
stock		1009.5K		55.78%
limit		1084.4K		54.47%
limit+ag	1058.1K		52.42%

Near + high
-----------
		IOPS		sys time
stock		 949.1K		75.12%
limit		 980.7K		64.74%
limit+ag	1010.1K		70.27%

Far + low
---------
		IOPS		sys time
stock		600.0K		72.28%
limit		761.7K		71.17%
limit+ag	762.5K		74.48%

Far + high
----------
		IOPS		sys time
stock		465.9K		91.66%
limit		716.2K		88.68%
limit+ag	758.0K		91.00%

One huge issue on this box is that it's a 4 socket/node machine, with 32
cores (64 threads). Combined with a 255 queue depth limit, the percpu
caching does not work well. I did not include stock+ag results, they
didn't change things very much for me. We simply have to limit the
stealing first, or we're still going to be hammering on percpu locks. If
we compare the top profiles from stock-far-high and limit+ag-far-high,
it looks pretty scary. Here's the stock one:

-  50,84%  fio  [kernel.kallsyms]
_raw_spin_lock
      + 89,83% percpu_ida_alloc
      + 6,03% mtip_queue_rq
      + 2,90% percpu_ida_free

so 50% of the system time spent acquiring a spinlock, with 90% of that
being percpu ida. The limit+ag variant looks like this:

-  32,93%  fio  [kernel.kallsyms]
_raw_spin_lock
      + 78,35% percpu_ida_alloc
      + 19,49% mtip_queue_rq
      + 1,21% __blk_mq_run_hw_queue

which is still pretty horrid and has plenty of room for improvement. I
think we need to make better decisions on the granularity of the tag
caching. If we ignore thread siblings, that'll double our effective
caching. If that's still not enough, I bet per-node/socket would be a
huge improvement.

-- 
Jens Axboe

View attachment "limit-steal.patch" of type "text/x-patch" (691 bytes)