[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241127054737.33351-1-bharata@amd.com>
Date: Wed, 27 Nov 2024 11:17:36 +0530
From: Bharata B Rao <bharata@....com>
To: <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-fsdevel@...r.kernel.org>, <linux-mm@...ck.org>
CC: <nikunj@....com>, <willy@...radead.org>, <vbabka@...e.cz>,
<david@...hat.com>, <akpm@...ux-foundation.org>, <yuzhao@...gle.com>,
<mjguzik@...il.com>, <axboe@...nel.dk>, <viro@...iv.linux.org.uk>,
<brauner@...nel.org>, <jack@...e.cz>, <joshdon@...gle.com>, <clm@...a.com>,
Bharata B Rao <bharata@....com>
Subject: [RFC PATCH 0/1] Large folios in block buffered IO path
Recently we discussed the scalability issues while running large
instances of FIO with buffered IO option on NVME block devices here:
https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
One of the suggestions Chris Mason gave (during private discussions) was
to enable large folios in block buffered IO path as that could
improve the scalability problems and improve the lock contention
scenarios.
This is an attempt to check the feasibility and potential benefit of the
same. To keep changes to minimum and also to non-disruptively test this
for the required block device only, I have added an ioctl to set large
folios support on block device mapping. I understand that this is not
the right way to do this but this is just an experiment to evaluate the
potential benefit.
Experimental setup
------------------
2 node EPYC server based Zen5 server with 512G memory in each node.
Disk layout for FIO:
nvme2n1 259:12 0 3.5T 0 disk
├─nvme2n1p1 259:13 0 894.3G 0 part
├─nvme2n1p2 259:14 0 894.3G 0 part
├─nvme2n1p3 259:15 0 894.3G 0 part
└─nvme2n1p4 259:16 0 894.1G 0 part
Four parallel instances of FIO are run on the above 4 partitions with
the following options:
-filename=/dev/nvme2n1p[1,2,3,4] -direct=0 -thread -size=800G -rw=rw -rwmixwrite=[10,30,50] --norandommap --randrepeat=0 -ioengine=sync -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting
Results
-------
default: Unmodified kernel and FIO.
patched: Kernel with BLKSETLFOLIO ioctl(introduced in this patchset) and FIO
modified to issue that ioctl.
In the below table, r is READ bw and w is WRITE bw reported by FIO.
default patched
ro (w/o -rw=rw option)
Instance 1 r=12.3GiB/s r=39.4GiB/s
Instance 2 r=12.2GiB/s r=39.1GiB/s
Instance 3 r=16.3GiB/s r=37.1GiB/s
Instance 4 r=14.9GiB/s r=42.9GiB/s
rwmixwrite=10%
Instance 1 r=27.5GiB/s,w=3125MiB/s r=75.9GiB/s,w=8636MiB/s
Instance 2 r=25.5GiB/s,w=2898MiB/s r=87.6GiB/s,w=9967MiB/s
Instance 3 r=25.7GiB/s,w=2922MiB/s r=78.3GiB/s,w=8904MiB/s
Instance 4 r=27.5GiB/s,w=3134MiB/s r=73.5GiB/s,w=8365MiB/s
rwmixwrite=30%
Instance 1 r=55.7GiB/s,w=23.9GiB/s r=59.2GiB/s,w=25.4GiB/s
Instance 2 r=38.5GiB/s,w=16.5GiB/s r=57.6GiB/s,w=24.7GiB/s
Instance 3 r=37.5GiB/s,w=16.1GiB/s r=59.5GiB/s,w=25.5GiB/s
Instance 4 r=37.4GiB/s,w=16.0GiB/s r=63.3GiB/s,w=27.1GiB/s
rwmixwrite=50%
Instance 1 r=37.1GiB/s,w=37.1GiB/s r=40.7GiB/s,w=40.7GiB/s
Instance 2 r=37.6GiB/s,w=37.6GiB/s r=45.9GiB/s,w=45.9GiB/s
Instance 3 r=35.1GiB/s,w=35.1GiB/s r=49.2GiB/s,w=49.2GiB/s
Instance 4 r=43.6GiB/s,w=43.6GiB/s r=41.2GiB/s,w=41.2GiB/s
Summary of FIO throughput
-------------------------
- Significant increase(3x) in bandwidth for ro case.
- Significant increase(3x) in bandwidth for rw 10%.
- Good gains(~1.15 to 1.5x) for 30% and 50%.
perf-lock contention output
---------------------------
The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
mix it looks like this:
perf-lock contention default
contended total wait max wait avg wait type caller
1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42
0xffffffff8f39e88f up_write+0x4f
0xffffffff8f9d598e blkdev_llseek+0x4e
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31
0xffffffff903f15bc rwsem_down_write_slowpath+0x36c
0xffffffff903f18fb down_write+0x5b
0xffffffff8f9d5971 blkdev_llseek+0x31
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
0xffffffff903dce5e do_syscall_64+0x7e
0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76
134057198 14.27 h 35.93 ms 383.14 us spinlock clear_shadow_entries+0x57
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f5e7967 clear_shadow_entries+0x57
0xffffffff8f5e90e3 mapping_try_invalidate+0x163
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
0xffffffff8f9d3872 invalidate_bdev+0x42
0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae
0xffffffff8f9faea1 blkdev_ioctl+0xc1
33351524 1.76 h 35.86 ms 190.43 us spinlock __remove_mapping+0x5d
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f5ec71d __remove_mapping+0x5d
0xffffffff8f5f9be6 remove_mapping+0x16
0xffffffff8f5e8f5b mapping_evict_folio+0x7b
0xffffffff8f5e9068 mapping_try_invalidate+0xe8
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
0xffffffff8f9d3872 invalidate_bdev+0x42
9448820 14.96 m 1.54 ms 95.01 us spinlock folio_lruvec_lock_irqsave+0x64
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64
0xffffffff8f5e587c folio_batch_move_lru+0x5c
0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1
0xffffffff8f5e7593 deactivate_file_folio+0x43
0xffffffff8f5e90b7 mapping_try_invalidate+0x137
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
1488531 11.07 m 1.07 ms 446.39 us spinlock try_to_free_buffers+0x56
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f768c76 try_to_free_buffers+0x56
0xffffffff8f5cf647 filemap_release_folio+0x87
0xffffffff8f5e8f4c mapping_evict_folio+0x6c
0xffffffff8f5e9068 mapping_try_invalidate+0xe8
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
0xffffffff8f9d3872 invalidate_bdev+0x42
2556868 6.78 m 474.72 us 159.07 us spinlock blkdev_llseek+0x31
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5d01 _raw_spin_lock_irq+0x51
0xffffffff903f14c4 rwsem_down_write_slowpath+0x274
0xffffffff903f18fb down_write+0x5b
0xffffffff8f9d5971 blkdev_llseek+0x31
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
2512627 3.75 m 450.96 us 89.55 us spinlock blkdev_llseek+0x31
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5d01 _raw_spin_lock_irq+0x51
0xffffffff903f12f0 rwsem_down_write_slowpath+0xa0
0xffffffff903f18fb down_write+0x5b
0xffffffff8f9d5971 blkdev_llseek+0x31
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
908184 1.52 m 439.58 us 100.58 us spinlock blkdev_llseek+0x31
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5d01 _raw_spin_lock_irq+0x51
0xffffffff903f1367 rwsem_down_write_slowpath+0x117
0xffffffff903f18fb down_write+0x5b
0xffffffff8f9d5971 blkdev_llseek+0x31
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
134 1.48 m 1.22 s 663.88 ms mutex bdev_release+0x69
0xffffffff903ef1de __mutex_lock.constprop.0+0x17e
0xffffffff903ef863 __mutex_lock_slowpath+0x13
0xffffffff903ef8bb mutex_lock+0x3b
0xffffffff8f9d5249 bdev_release+0x69
0xffffffff8f9d5921 blkdev_release+0x11
0xffffffff8f7089f3 __fput+0xe3
0xffffffff8f708c9b __fput_sync+0x1b
0xffffffff8f6fe8ed __x64_sys_close+0x3d
perf-lock contention patched
contended total wait max wait avg wait type caller
1153627 40.15 h 48.67 s 125.30 ms rwsem:W blkdev_llseek+0x31
0xffffffff903f15bc rwsem_down_write_slowpath+0x36c
0xffffffff903f18fb down_write+0x5b
0xffffffff8f9d5971 blkdev_llseek+0x31
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
0xffffffff903dce5e do_syscall_64+0x7e
0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76
276512439 39.19 h 46.90 ms 510.22 us spinlock clear_shadow_entries+0x57
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f5e7967 clear_shadow_entries+0x57
0xffffffff8f5e90e3 mapping_try_invalidate+0x163
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
0xffffffff8f9d3872 invalidate_bdev+0x42
0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae
0xffffffff8f9faea1 blkdev_ioctl+0xc1
763119320 26.37 h 887.44 us 124.38 us spinlock rwsem_wake.isra.0+0x42
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42
0xffffffff8f39e88f up_write+0x4f
0xffffffff8f9d598e blkdev_llseek+0x4e
0xffffffff8f703322 ksys_lseek+0x72
0xffffffff8f7033a8 __x64_sys_lseek+0x18
0xffffffff8f20b983 x64_sys_call+0x1fb3
33263910 2.87 h 29.43 ms 310.56 us spinlock __remove_mapping+0x5d
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f5ec71d __remove_mapping+0x5d
0xffffffff8f5f9be6 remove_mapping+0x16
0xffffffff8f5e8f5b mapping_evict_folio+0x7b
0xffffffff8f5e9068 mapping_try_invalidate+0xe8
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
0xffffffff8f9d3872 invalidate_bdev+0x42
58671816 2.50 h 519.68 us 153.45 us spinlock folio_lruvec_lock_irqsave+0x64
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64
0xffffffff8f5e587c folio_batch_move_lru+0x5c
0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1
0xffffffff8f5e7593 deactivate_file_folio+0x43
0xffffffff8f5e90b7 mapping_try_invalidate+0x137
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
284 22.33 m 5.35 s 4.72 s mutex bdev_release+0x69
0xffffffff903ef1de __mutex_lock.constprop.0+0x17e
0xffffffff903ef863 __mutex_lock_slowpath+0x13
0xffffffff903ef8bb mutex_lock+0x3b
0xffffffff8f9d5249 bdev_release+0x69
0xffffffff8f9d5921 blkdev_release+0x11
0xffffffff8f7089f3 __fput+0xe3
0xffffffff8f708c9b __fput_sync+0x1b
0xffffffff8f6fe8ed __x64_sys_close+0x3d
2181469 21.38 m 1.15 ms 587.98 us spinlock try_to_free_buffers+0x56
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f768c76 try_to_free_buffers+0x56
0xffffffff8f5cf647 filemap_release_folio+0x87
0xffffffff8f5e8f4c mapping_evict_folio+0x6c
0xffffffff8f5e9068 mapping_try_invalidate+0xe8
0xffffffff8f5e9160 invalidate_mapping_pages+0x10
0xffffffff8f9d3872 invalidate_bdev+0x42
454398 4.22 m 37.54 ms 557.13 us spinlock __remove_mapping+0x5d
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f5c7f _raw_spin_lock+0x3f
0xffffffff8f5ec71d __remove_mapping+0x5d
0xffffffff8f5f4f04 shrink_folio_list+0xbc4
0xffffffff8f5f5a6b evict_folios+0x34b
0xffffffff8f5f772f try_to_shrink_lruvec+0x20f
0xffffffff8f5f79ef shrink_one+0x10f
0xffffffff8f5fb975 shrink_node+0xb45
773 3.53 m 2.60 s 273.76 ms mutex __lru_add_drain_all+0x3a
0xffffffff903ef1de __mutex_lock.constprop.0+0x17e
0xffffffff903ef863 __mutex_lock_slowpath+0x13
0xffffffff903ef8bb mutex_lock+0x3b
0xffffffff8f5e3d7a __lru_add_drain_all+0x3a
0xffffffff8f5e77a0 lru_add_drain_all+0x10
0xffffffff8f9d3861 invalidate_bdev+0x31
0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae
0xffffffff8f9faea1 blkdev_ioctl+0xc1
1997851 3.09 m 651.65 us 92.83 us spinlock folio_lruvec_lock_irqsave+0x64
0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64
0xffffffff8f5e587c folio_batch_move_lru+0x5c
0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1
0xffffffff8f5e5ae4 folio_add_lru+0x54
0xffffffff8f5d075d filemap_add_folio+0xcd
0xffffffff8f5e30c0 page_cache_ra_order+0x220
Observations from perf-lock contention
--------------------------------------
- Significant reduction of contention for inode_lock (inode->i_rwsem)
from blkdev_llseek() path.
- Significant increase in contention for inode->i_lock from invalidate
and remove_mapping paths.
- Significant increase in contention for lruvec spinlock from
deactive_file_folio path.
Request comments on the above and I am specifically looking for inputs
on these:
- Lock contention results and usefulness of large folios in bringing
down the contention in this specific case.
- If enabling large folios in block buffered IO path is a feasible
approach, inputs on doing this cleanly and correclty.
Bharata B Rao (1):
block/ioctl: Add an ioctl to enable large folios for block buffered IO
path
block/ioctl.c | 8 ++++++++
include/uapi/linux/fs.h | 2 ++
2 files changed, 10 insertions(+)
--
2.34.1
Powered by blists - more mailing lists