[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a517b3a-51b2-45d6-bea3-4a64b75dfd30@amd.com>
Date: Thu, 28 Nov 2024 09:31:50 +0530
From: Bharata B Rao <bharata@....com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com,
willy@...radead.org, vbabka@...e.cz, david@...hat.com,
akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
joshdon@...gle.com, clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
On 27-Nov-24 5:58 PM, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@....com> wrote:
>>
>> On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
>>> That is to say bare minimum this needs to be benchmarked before/after
>>> with the lock removed from the picture, like so:
>>>
>>> diff --git a/block/fops.c b/block/fops.c
>>> index 2d01c9007681..7f9e9e2f9081 100644
>>> --- a/block/fops.c
>>> +++ b/block/fops.c
>>> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
>>> static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>>> {
>>> struct inode *bd_inode = bdev_file_inode(file);
>>> - loff_t retval;
>>>
>>> - inode_lock(bd_inode);
>>> - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>>> - inode_unlock(bd_inode);
>>> - return retval;
>>> + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>>> }
>>>
>>> static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
>>>
>>> To be aborted if it blows up (but I don't see why it would).
>>
>> Thanks for this fix, will try and get back with results.
>>
>
> Please make sure to have results just with this change, no messing
> with folio sizes so that I have something for the patch submission.
The contention with inode_lock is gone after your above changes. The new
top 10 contention data looks like this now:
contended total wait max wait avg wait type caller
2441494015 172.15 h 1.72 ms 253.83 us spinlock
folio_wait_bit_common+0xd5
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
0xffffffffacdd1905 folio_wait_bit_common+0xd5
0xffffffffacdd2d0a filemap_get_pages+0x68a
0xffffffffacdd2e73 filemap_read+0x103
0xffffffffad1d67ba blkdev_read_iter+0x6a
0xffffffffacf06937 vfs_read+0x297
0xffffffffacf07653 ksys_read+0x73
25269947 1.58 h 1.72 ms 225.44 us spinlock
folio_wake_bit+0x62
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacdcf322 folio_wake_bit+0x62
0xffffffffacdd2ca7 filemap_get_pages+0x627
0xffffffffacdd2e73 filemap_read+0x103
0xffffffffad1d67ba blkdev_read_iter+0x6a
0xffffffffacf06937 vfs_read+0x297
0xffffffffacf07653 ksys_read+0x73
44757761 1.05 h 1.55 ms 84.41 us spinlock
folio_wake_bit+0x62
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacdcf322 folio_wake_bit+0x62
0xffffffffacdcf7bc folio_end_read+0x2c
0xffffffffacf6d4cf mpage_read_end_io+0x6f
0xffffffffad1d8abb bio_endio+0x12b
0xffffffffad1f07bd blk_mq_end_request_batch+0x12d
0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb
66419830 53.00 m 685.29 us 47.88 us spinlock
__filemap_add_folio+0x14c
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
0xffffffffacdd037c __filemap_add_folio+0x14c
0xffffffffacdd072c filemap_add_folio+0x9c
0xffffffffacde2a4c page_cache_ra_unbounded+0x12c
0xffffffffacde31e0 page_cache_ra_order+0x340
0xffffffffacde33b8 page_cache_sync_ra+0x148
0xffffffffacdd27c5 filemap_get_pages+0x145
532 45.51 m 5.49 s 5.13 s mutex
bdev_release+0x69
0xffffffffadbef1de __mutex_lock.constprop.0+0x17e
0xffffffffadbef863 __mutex_lock_slowpath+0x13
0xffffffffadbef8bb mutex_lock+0x3b
0xffffffffad1d5249 bdev_release+0x69
0xffffffffad1d5921 blkdev_release+0x11
0xffffffffacf089f3 __fput+0xe3
0xffffffffacf08c9b __fput_sync+0x1b
0xffffffffacefe8ed __x64_sys_close+0x3d
264989621 45.26 m 486.17 us 10.25 us spinlock
raw_spin_rq_lock_nested+0x1d
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5c7f _raw_spin_lock+0x3f
0xffffffffacb4f22d raw_spin_rq_lock_nested+0x1d
0xffffffffacb5b601 _raw_spin_rq_lock_irqsave+0x21
0xffffffffacb6f81b sched_balance_rq+0x60b
0xffffffffacb70784 sched_balance_newidle+0x1f4
0xffffffffacb70ae4 pick_next_task_fair+0x54
0xffffffffacb4e583 __pick_next_task+0x43
30026842 26.05 m 652.85 us 52.05 us spinlock
__folio_mark_dirty+0x2b
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacde189b __folio_mark_dirty+0x2b
0xffffffffacf67503 mark_buffer_dirty+0x53
0xffffffffacf676e2 __block_commit_write+0x82
0xffffffffacf685fc block_write_end+0x3c
0xffffffffacfb577e iomap_write_end+0x13e
0xffffffffacfb674c iomap_file_buffered_write+0x29c
5085820 7.07 m 1.24 ms 83.42 us spinlock
folio_wake_bit+0x62
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacdcf322 folio_wake_bit+0x62
0xffffffffacdcf7bc folio_end_read+0x2c
0xffffffffacf6d4cf mpage_read_end_io+0x6f
0xffffffffad1d8abb bio_endio+0x12b
0xffffffffad1ee8ca blk_update_request+0x1aa
0xffffffffad1ef7a4 blk_mq_end_request+0x24
8027370 1.90 m 76.21 us 14.23 us spinlock
tick_do_update_jiffies64+0x29
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5c7f _raw_spin_lock+0x3f
0xffffffffacc21c69 tick_do_update_jiffies64+0x29
0xffffffffacc21e8c tick_nohz_handler+0x13c
0xffffffffacc0a0cf __hrtimer_run_queues+0x10f
0xffffffffacc0afe8 hrtimer_interrupt+0xf8
0xffffffffacaa8f36
__sysvec_apic_timer_interrupt+0x56
0xffffffffadbe3953
sysvec_apic_timer_interrupt+0x93
4344269 1.08 m 600.67 us 14.90 us spinlock
__filemap_add_folio+0x14c
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
0xffffffffacdd037c __filemap_add_folio+0x14c
0xffffffffacdd072c filemap_add_folio+0x9c
0xffffffffacde2a4c page_cache_ra_unbounded+0x12c
0xffffffffacde31e0 page_cache_ra_order+0x340
0xffffffffacde3615 page_cache_async_ra+0x165
0xffffffffacdd2b9c filemap_get_pages+0x51c
However a point of concern is that FIO bandwidth comes down drastically
after the change.
default inode_lock-fix
rw=30%
Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s
Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s
Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s
Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s
Regards,
Bharata.
Powered by blists - more mailing lists