[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3947869f-90d4-4912-a42f-197147fe64f0@amd.com>
Date: Wed, 27 Nov 2024 17:48:31 +0530
From: Bharata B Rao <bharata@....com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com,
willy@...radead.org, vbabka@...e.cz, david@...hat.com,
akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
joshdon@...gle.com, clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@...il.com> wrote:
>>
>> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@....com> wrote:
>>>
>>> Recently we discussed the scalability issues while running large
>>> instances of FIO with buffered IO option on NVME block devices here:
>>>
>>> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>>>
>>> One of the suggestions Chris Mason gave (during private discussions) was
>>> to enable large folios in block buffered IO path as that could
>>> improve the scalability problems and improve the lock contention
>>> scenarios.
>>>
>>
>> I have no basis to comment on the idea.
>>
>> However, it is pretty apparent whatever the situation it is being
>> heavily disfigured by lock contention in blkdev_llseek:
>>
>>> perf-lock contention output
>>> ---------------------------
>>> The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>>> mix it looks like this:
>>>
>>> perf-lock contention default
>>> contended total wait max wait avg wait type caller
>>>
>>> 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42
>>> 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
>>> 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
>>> 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42
>>> 0xffffffff8f39e88f up_write+0x4f
>>> 0xffffffff8f9d598e blkdev_llseek+0x4e
>>> 0xffffffff8f703322 ksys_lseek+0x72
>>> 0xffffffff8f7033a8 __x64_sys_lseek+0x18
>>> 0xffffffff8f20b983 x64_sys_call+0x1fb3
>>> 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31
>>> 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c
>>> 0xffffffff903f18fb down_write+0x5b
>>> 0xffffffff8f9d5971 blkdev_llseek+0x31
>>> 0xffffffff8f703322 ksys_lseek+0x72
>>> 0xffffffff8f7033a8 __x64_sys_lseek+0x18
>>> 0xffffffff8f20b983 x64_sys_call+0x1fb3
>>> 0xffffffff903dce5e do_syscall_64+0x7e
>>> 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76
>>
>> Admittedly I'm not familiar with this code, but at a quick glance the
>> lock can be just straight up removed here?
>>
>> 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>> 535 {
>> 536 │ struct inode *bd_inode = bdev_file_inode(file);
>> 537 │ loff_t retval;
>> 538 │
>> 539 │ inode_lock(bd_inode);
>> 540 │ retval = fixed_size_llseek(file, offset, whence,
>> i_size_read(bd_inode));
>> 541 │ inode_unlock(bd_inode);
>> 542 │ return retval;
>> 543 }
>>
>> At best it stabilizes the size for the duration of the call. Sounds
>> like it helps nothing since if the size can change, the file offset
>> will still be altered as if there was no locking?
>>
>> Suppose this cannot be avoided to grab the size for whatever reason.
>>
>> While the above fio invocation did not work for me, I ran some crapper
>> which I had in my shell history and according to strace:
>> [pid 271829] lseek(7, 0, SEEK_SET) = 0
>> [pid 271829] lseek(7, 0, SEEK_SET) = 0
>> [pid 271830] lseek(7, 0, SEEK_SET) = 0
>>
>> ... the lseeks just rewind to the beginning, *definitely* not needing
>> to know the size. One would have to check but this is most likely the
>> case in your test as well.
>>
>> And for that there is 0 need to grab the size, and consequently the inode lock.
Here is the complete FIO cmdline I am using:
fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=64k
-numjobs=1 -runtime=3600 --time_based -group_reporting -name=mytest
And that results in lseek patterns like these:
lseek(6, 0, SEEK_SET) = 0
lseek(6, 131072, SEEK_SET) = 131072
lseek(6, 65536, SEEK_SET) = 65536
lseek(6, 196608, SEEK_SET) = 196608
lseek(6, 131072, SEEK_SET) = 131072
lseek(6, 393216, SEEK_SET) = 393216
lseek(6, 196608, SEEK_SET) = 196608
lseek(6, 458752, SEEK_SET) = 458752
lseek(6, 262144, SEEK_SET) = 262144
lseek(6, 1114112, SEEK_SET) = 1114112
The lseeks are interspersed with read and write calls.
>
> That is to say bare minimum this needs to be benchmarked before/after
> with the lock removed from the picture, like so:
>
> diff --git a/block/fops.c b/block/fops.c
> index 2d01c9007681..7f9e9e2f9081 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
> static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
> {
> struct inode *bd_inode = bdev_file_inode(file);
> - loff_t retval;
>
> - inode_lock(bd_inode);
> - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> - inode_unlock(bd_inode);
> - return retval;
> + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> }
>
> static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
>
> To be aborted if it blows up (but I don't see why it would).
Thanks for this fix, will try and get back with results.
Regards,
Bharata.
Powered by blists - more mailing lists