lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 15 Dec 2022 17:21:23 +0800
From:   Zhang Yi <>
To:     Jan Kara <>
Subject: Re: [RFC PATCH] ext4: dio take shared inode lock when overwriting
 preallocated blocks

On 2022/12/15 17:00, Jan Kara wrote:
> On Thu 15-12-22 16:24:49, Zhang Yi wrote:
>> On 2022/12/15 1:01, Jan Kara wrote:
>>> On Sat 03-12-22 18:39:56, Zhang Yi wrote:
>>>> In the dio write path, we only take shared inode lock for the case of
>>>> aligned overwriting initialized blocks inside EOF. But for overwriting
>>>> preallocated blocks, it may only need to split unwritten extents, this
>>>> procedure has been protected under i_data_sem lock, it's safe to
>>>> release the exclusive inode lock and take shared inode lock.
>>>> This could give a significant speed up for multi-threaded writes. Test
>>>> on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
>>>>  direct=1
>>>>  ioengine=libaio
>>>>  iodepth=10
>>>>  numjobs=10
>>>>  runtime=60
>>>>  rw=randwrite
>>>>  size=100G
>>>> And the test result are:
>>>> Before:
>>>>  bs=4k       IOPS=11.1k, BW=43.2MiB/s
>>>>  bs=16k      IOPS=11.1k, BW=173MiB/s
>>>>  bs=64k      IOPS=11.2k, BW=697MiB/s
>>>> After:
>>>>  bs=4k       IOPS=41.4k, BW=162MiB/s
>>>>  bs=16k      IOPS=41.3k, BW=646MiB/s
>>>>  bs=64k      IOPS=13.5k, BW=843MiB/s
>>>> Signed-off-by: Zhang Yi <>
>>>> ---
>>>>  It passed xfstests auto mode with 1k and 4k blocksize.
>>> Besides some naming nits (see below) I think this should work. But I have
>>> to say I'm a bit uneasy about this because we will now be changing block
>>> mapping from unwritten to written only with shared i_rwsem. OTOH that
>>> happens during writeback as well so we should be fine and the gain is very
>>> nice.
>> Thanks for advice, I will change the argument name to make it more reasonable.
>>> Out of curiosity do you have a real usecase for this?
>> No, I was just analyse the performance gap in our benchmark tests, and have
>> some question and idea while reading the code. Maybe it could speed up the
>> first time write in some database. :)
>> Besides, I want to discuss it a bit more. I originally changed this code to
>> switch to take the shared inode and also use ext4_iomap_overwrite_ops for
>> the overwriting preallocated blocks case. It will postpone the splitting extent
>> procedure to endio and could give a much more gain than this patch (+~27%).
>> This patch:
>>   bs=4k       IOPS=41.4k, BW=162MiB/s
>> Postpone split:
>>   bs=4k       IOPS=52.9k, BW=207MiB/s
>> But I think it's maybe too radical. I looked at the history and notice in
>> commit 0031462b5b39 ("ext4: Split uninitialized extents for direct I/O"),
>> it introduce the flag EXT4_GET_BLOCKS_DIO(now it had been renamed to
>> EXT4_GET_BLOCKS_PRE_IO) to make sure that the preallocated unwritten
>> extent have been splitted before submitting the I/O, which is used to
>> prevent the ENOSPC problem if the filesystem run out-of-space in the
>> endio procedure. And 4 years later, commit 27dd43854227 ("ext4: introduce
>> reserved space") reserve some blocks for metedata allocation.  It looks
>> like this commit could also slove the ENOSPC problem for most cases if we
>> postpone extent splitting into the endio procedure. I don't find other
>> side effect, so I think it may also fine if we do that. Do you have some
>> advice or am I missing something?
> So you are right these days splitting of extents could be done only on IO
> completion because we have a pool of blocks reserved for these cases. OTOH
> this will make the pressure on the reserved pool higher and if we are
> running out of space and there are enough operations running in parallel we
> *could* run out of reserved blocks. So I wouldn't always defer extent
> splitting to IO completion unless we have a practical and sufficiently
> widespread usecase that would benefit from this optimization.

Sure, I think so, thanks for advice.


Powered by blists - more mailing lists