linux-ext4 - Re: [RFC PATCH] ext4: dio take shared inode lock when overwriting preallocated blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <e1294bbb-008f-a789-99bf-33de63089beb@huaweicloud.com>
Date:   Thu, 15 Dec 2022 17:21:23 +0800
From:   Zhang Yi <yi.zhang@...weicloud.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org, tytso@....edu,
        adilger.kernel@...ger.ca, yukuai3@...wei.com
Subject: Re: [RFC PATCH] ext4: dio take shared inode lock when overwriting
 preallocated blocks

On 2022/12/15 17:00, Jan Kara wrote:
> On Thu 15-12-22 16:24:49, Zhang Yi wrote:
>> On 2022/12/15 1:01, Jan Kara wrote:
>>> On Sat 03-12-22 18:39:56, Zhang Yi wrote:
>>>> In the dio write path, we only take shared inode lock for the case of
>>>> aligned overwriting initialized blocks inside EOF. But for overwriting
>>>> preallocated blocks, it may only need to split unwritten extents, this
>>>> procedure has been protected under i_data_sem lock, it's safe to
>>>> release the exclusive inode lock and take shared inode lock.
>>>>
>>>> This could give a significant speed up for multi-threaded writes. Test
>>>> on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
>>>>
>>>>  direct=1
>>>>  ioengine=libaio
>>>>  iodepth=10
>>>>  numjobs=10
>>>>  runtime=60
>>>>  rw=randwrite
>>>>  size=100G
>>>>
>>>> And the test result are:
>>>> Before:
>>>>  bs=4k       IOPS=11.1k, BW=43.2MiB/s
>>>>  bs=16k      IOPS=11.1k, BW=173MiB/s
>>>>  bs=64k      IOPS=11.2k, BW=697MiB/s
>>>>
>>>> After:
>>>>  bs=4k       IOPS=41.4k, BW=162MiB/s
>>>>  bs=16k      IOPS=41.3k, BW=646MiB/s
>>>>  bs=64k      IOPS=13.5k, BW=843MiB/s
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>>>> ---
>>>>  It passed xfstests auto mode with 1k and 4k blocksize.
>>>
>>> Besides some naming nits (see below) I think this should work. But I have
>>> to say I'm a bit uneasy about this because we will now be changing block
>>> mapping from unwritten to written only with shared i_rwsem. OTOH that
>>> happens during writeback as well so we should be fine and the gain is very
>>> nice.
>>>
>> Thanks for advice, I will change the argument name to make it more reasonable.
>>
>>> Out of curiosity do you have a real usecase for this?
>>
>> No, I was just analyse the performance gap in our benchmark tests, and have
>> some question and idea while reading the code. Maybe it could speed up the
>> first time write in some database. :)
>>
>> Besides, I want to discuss it a bit more. I originally changed this code to
>> switch to take the shared inode and also use ext4_iomap_overwrite_ops for
>> the overwriting preallocated blocks case. It will postpone the splitting extent
>> procedure to endio and could give a much more gain than this patch (+~27%).
>>
>> This patch:
>>   bs=4k       IOPS=41.4k, BW=162MiB/s
>> Postpone split:
>>   bs=4k       IOPS=52.9k, BW=207MiB/s
>>
>> But I think it's maybe too radical. I looked at the history and notice in
>> commit 0031462b5b39 ("ext4: Split uninitialized extents for direct I/O"),
>> it introduce the flag EXT4_GET_BLOCKS_DIO(now it had been renamed to
>> EXT4_GET_BLOCKS_PRE_IO) to make sure that the preallocated unwritten
>> extent have been splitted before submitting the I/O, which is used to
>> prevent the ENOSPC problem if the filesystem run out-of-space in the
>> endio procedure. And 4 years later, commit 27dd43854227 ("ext4: introduce
>> reserved space") reserve some blocks for metedata allocation.  It looks
>> like this commit could also slove the ENOSPC problem for most cases if we
>> postpone extent splitting into the endio procedure. I don't find other
>> side effect, so I think it may also fine if we do that. Do you have some
>> advice or am I missing something?
> 
> So you are right these days splitting of extents could be done only on IO
> completion because we have a pool of blocks reserved for these cases. OTOH
> this will make the pressure on the reserved pool higher and if we are
> running out of space and there are enough operations running in parallel we
> *could* run out of reserved blocks. So I wouldn't always defer extent
> splitting to IO completion unless we have a practical and sufficiently
> widespread usecase that would benefit from this optimization.
> 

Sure, I think so, thanks for advice.

Yi.