linux-kernel - Re: [RFC PATCH 0/1] Large folios in block buffered IO path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87plmf3oh8.fsf@gmail.com>
Date: Thu, 28 Nov 2024 11:10:35 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: Jan Kara <jack@...e.cz>, Mateusz Guzik <mjguzik@...il.com>
Cc: Bharata B Rao <bharata@....com>, linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com, willy@...radead.org, vbabka@...e.cz, david@...hat.com, akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk, viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, joshdon@...gle.com, clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path

Jan Kara <jack@...e.cz> writes:

> On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
>> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@...il.com> wrote:
>> >
>> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@....com> wrote:
>> > >
>> > > Recently we discussed the scalability issues while running large
>> > > instances of FIO with buffered IO option on NVME block devices here:
>> > >
>> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>> > >
>> > > One of the suggestions Chris Mason gave (during private discussions) was
>> > > to enable large folios in block buffered IO path as that could
>> > > improve the scalability problems and improve the lock contention
>> > > scenarios.
>> > >
>> >
>> > I have no basis to comment on the idea.
>> >
>> > However, it is pretty apparent whatever the situation it is being
>> > heavily disfigured by lock contention in blkdev_llseek:
>> >
>> > > perf-lock contention output
>> > > ---------------------------
>> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>> > > mix it looks like this:
>> > >
>> > > perf-lock contention default
>> > >  contended   total wait     max wait     avg wait         type   caller
>> > >
>> > > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>> > >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>> > >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>> > >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>> > >                         0xffffffff8f39e88f  up_write+0x4f
>> > >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>> > >                         0xffffffff8f703322  ksys_lseek+0x72
>> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>> > >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>> > >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>> > >                         0xffffffff903f18fb  down_write+0x5b
>> > >                         0xffffffff8f9d5971  blkdev_llseek+0x31
>> > >                         0xffffffff8f703322  ksys_lseek+0x72
>> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>> > >                         0xffffffff903dce5e  do_syscall_64+0x7e
>> > >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>> >
>> > Admittedly I'm not familiar with this code, but at a quick glance the
>> > lock can be just straight up removed here?
>> >
>> >   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>> >   535 {
>> >   536 │       struct inode *bd_inode = bdev_file_inode(file);
>> >   537 │       loff_t retval;
>> >   538 │
>> >   539 │       inode_lock(bd_inode);
>> >   540 │       retval = fixed_size_llseek(file, offset, whence,
>> > i_size_read(bd_inode));
>> >   541 │       inode_unlock(bd_inode);
>> >   542 │       return retval;
>> >   543 }
>> >
>> > At best it stabilizes the size for the duration of the call. Sounds
>> > like it helps nothing since if the size can change, the file offset
>> > will still be altered as if there was no locking?
>> >
>> > Suppose this cannot be avoided to grab the size for whatever reason.
>> >
>> > While the above fio invocation did not work for me, I ran some crapper
>> > which I had in my shell history and according to strace:
>> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> > [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>> >
>> > ... the lseeks just rewind to the beginning, *definitely* not needing
>> > to know the size. One would have to check but this is most likely the
>> > case in your test as well.
>> >
>> > And for that there is 0 need to grab the size, and consequently the inode lock.
>> 
>> That is to say bare minimum this needs to be benchmarked before/after
>> with the lock removed from the picture, like so:
>
> Yeah, I've noticed this in the locking profiles as well and I agree
> bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
> get away without using inode lock in their llseek handler...
>

Right, we don't need an inode_lock() for i_size_read(). i_size_write()
still needs locking for serialization, mainly for 32bit SMP case, due
to use of seqcounts.
I guess it would be good to maybe add this in Documentation too rather
than this info just hanging on top of i_size_write()?

References
===========
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/locking.rst#n557
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h#n932
[3]: https://lore.kernel.org/all/20061016162729.176738000@szeredi.hu/

-ritesh