lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEvrML100XBTT=sBDud5L2zeQ3ja5BmBCL2TTYYoEC55A@mail.gmail.com>
Date: Wed, 27 Nov 2024 07:19:59 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Bharata B Rao <bharata@....com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com, 
	willy@...radead.org, vbabka@...e.cz, david@...hat.com, 
	akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk, 
	viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, joshdon@...gle.com, 
	clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path

On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@....com> wrote:
> >
> > Recently we discussed the scalability issues while running large
> > instances of FIO with buffered IO option on NVME block devices here:
> >
> > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> >
> > One of the suggestions Chris Mason gave (during private discussions) was
> > to enable large folios in block buffered IO path as that could
> > improve the scalability problems and improve the lock contention
> > scenarios.
> >
>
> I have no basis to comment on the idea.
>
> However, it is pretty apparent whatever the situation it is being
> heavily disfigured by lock contention in blkdev_llseek:
>
> > perf-lock contention output
> > ---------------------------
> > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> > mix it looks like this:
> >
> > perf-lock contention default
> >  contended   total wait     max wait     avg wait         type   caller
> >
> > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
> >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
> >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
> >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
> >                         0xffffffff8f39e88f  up_write+0x4f
> >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
> >                         0xffffffff8f703322  ksys_lseek+0x72
> >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
> >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
> >                         0xffffffff903f18fb  down_write+0x5b
> >                         0xffffffff8f9d5971  blkdev_llseek+0x31
> >                         0xffffffff8f703322  ksys_lseek+0x72
> >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> >                         0xffffffff903dce5e  do_syscall_64+0x7e
> >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>
> Admittedly I'm not familiar with this code, but at a quick glance the
> lock can be just straight up removed here?
>
>   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>   535 {
>   536 │       struct inode *bd_inode = bdev_file_inode(file);
>   537 │       loff_t retval;
>   538 │
>   539 │       inode_lock(bd_inode);
>   540 │       retval = fixed_size_llseek(file, offset, whence,
> i_size_read(bd_inode));
>   541 │       inode_unlock(bd_inode);
>   542 │       return retval;
>   543 }
>
> At best it stabilizes the size for the duration of the call. Sounds
> like it helps nothing since if the size can change, the file offset
> will still be altered as if there was no locking?
>
> Suppose this cannot be avoided to grab the size for whatever reason.
>
> While the above fio invocation did not work for me, I ran some crapper
> which I had in my shell history and according to strace:
> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>
> ... the lseeks just rewind to the beginning, *definitely* not needing
> to know the size. One would have to check but this is most likely the
> case in your test as well.
>
> And for that there is 0 need to grab the size, and consequently the inode lock.

That is to say bare minimum this needs to be benchmarked before/after
with the lock removed from the picture, like so:

diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..7f9e9e2f9081 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
 {
        struct inode *bd_inode = bdev_file_inode(file);
-       loff_t retval;

-       inode_lock(bd_inode);
-       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
-       inode_unlock(bd_inode);
-       return retval;
+       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
 }

 static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

To be aborted if it blows up (but I don't see why it would).

-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ