linux-kernel - Re: [PATCH] block: per-cpu counters for in-flight IO accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <538F2D33.2070106@kernel.dk>
Date:	Wed, 04 Jun 2014 08:29:07 -0600
From:	Jens Axboe <axboe@...nel.dk>
To:	Shaohua Li <shli@...nel.org>
CC:	Matias Bjørling <m@...rling.me>,
	sbradshaw@...ron.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] block: per-cpu counters for in-flight IO accounting

On 2014-06-04 04:39, Shaohua Li wrote:
> On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
>> On 2014-05-30 06:11, Shaohua Li wrote:
>>> On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
>>>> On 05/09/2014 08:12 AM, Jens Axboe wrote:
>>>>> On 05/09/2014 03:17 AM, Matias Bjørling wrote:
>>>>>> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>>>>>> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>>>>>> counters to elevate.
>>>>>
>>>>> The part stats are a pain in the butt, I've tried to come up with a
>>>>> great fix for them too. But I don't think the percpu conversion is
>>>>> necessarily the right one. The summing is part of the hotpath, so percpu
>>>>> counters aren't necessarily the right way to go. I don't have a better
>>>>> answer right now, otherwise it would have been fixed :-)
>>>>
>>>> Actual data point - this slows my test down ~14% compared to the stock
>>>> kernel. Also, if you experiment with this, you need to watch for the
>>>> out-of-core users of the part stats (like DM).
>>>
>>> I had a try with Matias's patch. Performance actually boost significantly.
>>> (there are other cache line issue though, eg, hd_struct_get). Jens, what did
>>> you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
>>> path. part_round_stats_single() uses it too, but it's a cold path too as we
>>> simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
>>> every 10ms instead of every jiffy?
>>
>> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
>> cores). The problem is the one hot path of part_in_flight(), summing
>> percpu for that is too expensive. On bigger systems than mine, it'd
>> be even worse.
>
> I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
> part_in_flight() is called in any hot path.

It's done for every IO completion, that is (by definition) a hot path. I 
tested on two devices here, and it was definitely slower. And my system 
only had just the right number of NR_CPUS, I suspect it'd be much worse 
on bigger systems.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/