[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1e9b8ff5-76ba-a8de-e8b9-bbdd07ebede8@nvidia.com>
Date: Wed, 9 Nov 2022 03:35:08 +0000
From: Chaitanya Kulkarni <chaitanyak@...dia.com>
To: Gabriel Krisman Bertazi <krisman@...e.de>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
Hugh Dickins <hughd@...gle.com>,
Keith Busch <kbusch@...nel.org>,
"axboe@...nel.dk" <axboe@...nel.dk>,
Liu Song <liusong@...ux.alibaba.com>, Jan Kara <jack@...e.cz>
Subject: Re: [PATCH] sbitmap: Use single per-bitmap counting to wake up queued
tags
On 11/8/22 19:03, Gabriel Krisman Bertazi wrote:
> Chaitanya Kulkarni <chaitanyak@...dia.com> writes:
>
>>> For more interesting cases, where there is queueing, we need to take
>>> into account the cross-communication of the atomic operations. I've
>>> been benchmarking by running parallel fio jobs against a single hctx
>>> nullb in different hardware queue depth scenarios, and verifying both
>>> IOPS and queueing.
>>>
>>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>>> varying only the hardware queue length per test.
>>>
>>> queue size 2 4 8 16 32 64
>>> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K)
>>> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)
>>
>>>
>
> Hi Chaitanya,
>
> Thanks for the feedback.
>
>> So if I understand correctly
>> QD 2,4,8 shows clear performance benefit from this patch whereas
>> QD 16, 32, 64 shows drop in performance it that correct ?
>>
>> If my observation is correct then applications with high QD will
>> observe drop in the performance ?
>
> To be honest, I'm not sure. Given the overlap of the standard variation
> (in parenthesis) with the mean, I'm not sure the observed drop is
> statistically significant. In my prior analysis, I thought it wasn't.
>
> I don't see where a significant difference would come from, to be honest,
> because the higher the QD, the more likely it is to go through the
> not-contended path, where sbq->ws_active == 0. This hot path is
> identical to the existing implementation.
>
The numbers are taken on the null_blk, with the drop I could see here
may end up being different on the real H/W ? and I cannot
comment on that since we don't have that data ...
Did you repeat the experiment with the real H/W like NVMe SSD ?
>> Also, please share a table with block size/IOPS/BW/CPU (system/user)
>> /LAT/SLAT with % increase/decrease and document the raw numbers at the
>> end of the cover-letter for completeness along with fio job to others
>> can repeat the experiment...
>
> This was issued against the nullb and the IO size is fixed, matching the
> device's block size (512b), which is why I am not tracking BW, only
> IOPS. I'm not sure the BW is still relevant in this scenario.
>
> I'll definitely follow up with CPU time and latencies, and share the
> fio job. I'll also take another look on the significance of the
> measured values for high QD.
>
Yes, please if CPU usage way higher then we need to know that above
numbers are at the cost of the higher CPU, in that case IOPs per core
B/W per core matrix can be very useful ?
-ck
Powered by blists - more mailing lists