[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHHo4sLNpoVw-WTGVCc-gL0xguYWfUWfV1CSsQo6-bGnFg@mail.gmail.com>
Date: Thu, 28 Nov 2024 05:31:38 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Bharata B Rao <bharata@....com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com,
willy@...radead.org, vbabka@...e.cz, david@...hat.com,
akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, joshdon@...gle.com,
clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
On Thu, Nov 28, 2024 at 5:22 AM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@....com> wrote:
> >
> > The contention with inode_lock is gone after your above changes. The new
> > top 10 contention data looks like this now:
> >
> > contended total wait max wait avg wait type caller
> >
> > 2441494015 172.15 h 1.72 ms 253.83 us spinlock
> > folio_wait_bit_common+0xd5
> > 0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> > 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
> > 0xffffffffacdd1905 folio_wait_bit_common+0xd5
> > 0xffffffffacdd2d0a filemap_get_pages+0x68a
> > 0xffffffffacdd2e73 filemap_read+0x103
> > 0xffffffffad1d67ba blkdev_read_iter+0x6a
> > 0xffffffffacf06937 vfs_read+0x297
> > 0xffffffffacf07653 ksys_read+0x73
> > 25269947 1.58 h 1.72 ms 225.44 us spinlock
> > folio_wake_bit+0x62
> > 0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
> > 0xffffffffacdcf322 folio_wake_bit+0x62
> > 0xffffffffacdd2ca7 filemap_get_pages+0x627
> > 0xffffffffacdd2e73 filemap_read+0x103
> > 0xffffffffad1d67ba blkdev_read_iter+0x6a
> > 0xffffffffacf06937 vfs_read+0x297
> > 0xffffffffacf07653 ksys_read+0x73
> > 44757761 1.05 h 1.55 ms 84.41 us spinlock
> > folio_wake_bit+0x62
> > 0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
> > 0xffffffffacdcf322 folio_wake_bit+0x62
> > 0xffffffffacdcf7bc folio_end_read+0x2c
> > 0xffffffffacf6d4cf mpage_read_end_io+0x6f
> > 0xffffffffad1d8abb bio_endio+0x12b
> > 0xffffffffad1f07bd blk_mq_end_request_batch+0x12d
> > 0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb
> [snip]
> > However a point of concern is that FIO bandwidth comes down drastically
> > after the change.
> >
>
> Nicely put :)
>
> > default inode_lock-fix
> > rw=30%
> > Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s
> > Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s
> > Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s
> > Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s
> >
>
> This means that the folio waiting stuff has poor scalability, but
> without digging into it I have no idea what can be done. The easy way
> out would be to speculatively spin before buggering off, but one would
> have to check what happens in real workloads -- presumably the lock
> owner can be off cpu for a long time (I presume there is no way to
> store the owner).
>
> The now-removed lock uses rwsems which behave better when contested
> and was pulling contention away from folios, artificially *helping*
> performance by having the folio bottleneck be exercised less.
>
> The right thing to do in the long run is still to whack the llseek
> lock acquire, but in the light of the above it can probably wait for
> better times.
WIlly mentioned the folio wait queue hash table could be grown, you
can find it in mm/filemap.c:
1062 #define PAGE_WAIT_TABLE_BITS 8
1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
__cacheline_aligned;
1065
1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
1067 {
1068 │ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
1069 }
Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out
On debian this ships with the bpfcc-tools package.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists