linux-kernel - Re: [RFC PATCH 0/1] Large folios in block buffered IO path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHHBu663RSjQUwi14_d+Ln6mw_ESvYCc6dTec-O0Wi1-Eg@mail.gmail.com>
Date: Thu, 28 Nov 2024 05:22:41 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Bharata B Rao <bharata@....com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com, 
	willy@...radead.org, vbabka@...e.cz, david@...hat.com, 
	akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk, 
	viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, joshdon@...gle.com, 
	clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path

On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@....com> wrote:
>
> The contention with inode_lock is gone after your above changes. The new
> top 10 contention data looks like this now:
>
>   contended   total wait     max wait     avg wait         type   caller
>
> 2441494015    172.15 h       1.72 ms    253.83 us     spinlock
> folio_wait_bit_common+0xd5
>                          0xffffffffadbf60a3
> native_queued_spin_lock_slowpath+0x1f3
>                          0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
>                          0xffffffffacdd1905  folio_wait_bit_common+0xd5
>                          0xffffffffacdd2d0a  filemap_get_pages+0x68a
>                          0xffffffffacdd2e73  filemap_read+0x103
>                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
>                          0xffffffffacf06937  vfs_read+0x297
>                          0xffffffffacf07653  ksys_read+0x73
>    25269947      1.58 h       1.72 ms    225.44 us     spinlock
> folio_wake_bit+0x62
>                          0xffffffffadbf60a3
> native_queued_spin_lock_slowpath+0x1f3
>                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
>                          0xffffffffacdcf322  folio_wake_bit+0x62
>                          0xffffffffacdd2ca7  filemap_get_pages+0x627
>                          0xffffffffacdd2e73  filemap_read+0x103
>                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
>                          0xffffffffacf06937  vfs_read+0x297
>                          0xffffffffacf07653  ksys_read+0x73
>    44757761      1.05 h       1.55 ms     84.41 us     spinlock
> folio_wake_bit+0x62
>                          0xffffffffadbf60a3
> native_queued_spin_lock_slowpath+0x1f3
>                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
>                          0xffffffffacdcf322  folio_wake_bit+0x62
>                          0xffffffffacdcf7bc  folio_end_read+0x2c
>                          0xffffffffacf6d4cf  mpage_read_end_io+0x6f
>                          0xffffffffad1d8abb  bio_endio+0x12b
>                          0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
>                          0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
[snip]
> However a point of concern is that FIO bandwidth comes down drastically
> after the change.
>

Nicely put :)

>                 default                         inode_lock-fix
> rw=30%
> Instance 1      r=55.7GiB/s,w=23.9GiB/s         r=9616MiB/s,w=4121MiB/s
> Instance 2      r=38.5GiB/s,w=16.5GiB/s         r=8482MiB/s,w=3635MiB/s
> Instance 3      r=37.5GiB/s,w=16.1GiB/s         r=8609MiB/s,w=3690MiB/s
> Instance 4      r=37.4GiB/s,w=16.0GiB/s         r=8486MiB/s,w=3637MiB/s
>

This means that the folio waiting stuff has poor scalability, but
without digging into it I have no idea what can be done. The easy way
out would be to speculatively spin before buggering off, but one would
have to check what happens in real workloads -- presumably the lock
owner can be off cpu for a long time (I presume there is no way to
store the owner).

The now-removed lock uses rwsems which behave better when contested
and was pulling contention away from folios, artificially *helping*
performance by having the folio bottleneck be exercised less.

The right thing to do in the long run is still to whack the llseek
lock acquire, but in the light of the above it can probably wait for
better times.
-- 
Mateusz Guzik <mjguzik gmail.com>