[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e1294bbb-008f-a789-99bf-33de63089beb@huaweicloud.com>
Date: Thu, 15 Dec 2022 17:21:23 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, tytso@....edu,
adilger.kernel@...ger.ca, yukuai3@...wei.com
Subject: Re: [RFC PATCH] ext4: dio take shared inode lock when overwriting
preallocated blocks
On 2022/12/15 17:00, Jan Kara wrote:
> On Thu 15-12-22 16:24:49, Zhang Yi wrote:
>> On 2022/12/15 1:01, Jan Kara wrote:
>>> On Sat 03-12-22 18:39:56, Zhang Yi wrote:
>>>> In the dio write path, we only take shared inode lock for the case of
>>>> aligned overwriting initialized blocks inside EOF. But for overwriting
>>>> preallocated blocks, it may only need to split unwritten extents, this
>>>> procedure has been protected under i_data_sem lock, it's safe to
>>>> release the exclusive inode lock and take shared inode lock.
>>>>
>>>> This could give a significant speed up for multi-threaded writes. Test
>>>> on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
>>>>
>>>> direct=1
>>>> ioengine=libaio
>>>> iodepth=10
>>>> numjobs=10
>>>> runtime=60
>>>> rw=randwrite
>>>> size=100G
>>>>
>>>> And the test result are:
>>>> Before:
>>>> bs=4k IOPS=11.1k, BW=43.2MiB/s
>>>> bs=16k IOPS=11.1k, BW=173MiB/s
>>>> bs=64k IOPS=11.2k, BW=697MiB/s
>>>>
>>>> After:
>>>> bs=4k IOPS=41.4k, BW=162MiB/s
>>>> bs=16k IOPS=41.3k, BW=646MiB/s
>>>> bs=64k IOPS=13.5k, BW=843MiB/s
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>>>> ---
>>>> It passed xfstests auto mode with 1k and 4k blocksize.
>>>
>>> Besides some naming nits (see below) I think this should work. But I have
>>> to say I'm a bit uneasy about this because we will now be changing block
>>> mapping from unwritten to written only with shared i_rwsem. OTOH that
>>> happens during writeback as well so we should be fine and the gain is very
>>> nice.
>>>
>> Thanks for advice, I will change the argument name to make it more reasonable.
>>
>>> Out of curiosity do you have a real usecase for this?
>>
>> No, I was just analyse the performance gap in our benchmark tests, and have
>> some question and idea while reading the code. Maybe it could speed up the
>> first time write in some database. :)
>>
>> Besides, I want to discuss it a bit more. I originally changed this code to
>> switch to take the shared inode and also use ext4_iomap_overwrite_ops for
>> the overwriting preallocated blocks case. It will postpone the splitting extent
>> procedure to endio and could give a much more gain than this patch (+~27%).
>>
>> This patch:
>> bs=4k IOPS=41.4k, BW=162MiB/s
>> Postpone split:
>> bs=4k IOPS=52.9k, BW=207MiB/s
>>
>> But I think it's maybe too radical. I looked at the history and notice in
>> commit 0031462b5b39 ("ext4: Split uninitialized extents for direct I/O"),
>> it introduce the flag EXT4_GET_BLOCKS_DIO(now it had been renamed to
>> EXT4_GET_BLOCKS_PRE_IO) to make sure that the preallocated unwritten
>> extent have been splitted before submitting the I/O, which is used to
>> prevent the ENOSPC problem if the filesystem run out-of-space in the
>> endio procedure. And 4 years later, commit 27dd43854227 ("ext4: introduce
>> reserved space") reserve some blocks for metedata allocation. It looks
>> like this commit could also slove the ENOSPC problem for most cases if we
>> postpone extent splitting into the endio procedure. I don't find other
>> side effect, so I think it may also fine if we do that. Do you have some
>> advice or am I missing something?
>
> So you are right these days splitting of extents could be done only on IO
> completion because we have a pool of blocks reserved for these cases. OTOH
> this will make the pressure on the reserved pool higher and if we are
> running out of space and there are enough operations running in parallel we
> *could* run out of reserved blocks. So I wouldn't always defer extent
> splitting to IO completion unless we have a practical and sufficiently
> widespread usecase that would benefit from this optimization.
>
Sure, I think so, thanks for advice.
Yi.
Powered by blists - more mailing lists