lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a517b3a-51b2-45d6-bea3-4a64b75dfd30@amd.com>
Date: Thu, 28 Nov 2024 09:31:50 +0530
From: Bharata B Rao <bharata@....com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, nikunj@....com,
 willy@...radead.org, vbabka@...e.cz, david@...hat.com,
 akpm@...ux-foundation.org, yuzhao@...gle.com, axboe@...nel.dk,
 viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
 joshdon@...gle.com, clm@...a.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path

On 27-Nov-24 5:58 PM, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@....com> wrote:
>>
>> On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
>>> That is to say bare minimum this needs to be benchmarked before/after
>>> with the lock removed from the picture, like so:
>>>
>>> diff --git a/block/fops.c b/block/fops.c
>>> index 2d01c9007681..7f9e9e2f9081 100644
>>> --- a/block/fops.c
>>> +++ b/block/fops.c
>>> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
>>>    static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>>>    {
>>>           struct inode *bd_inode = bdev_file_inode(file);
>>> -       loff_t retval;
>>>
>>> -       inode_lock(bd_inode);
>>> -       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>>> -       inode_unlock(bd_inode);
>>> -       return retval;
>>> +       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>>>    }
>>>
>>>    static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
>>>
>>> To be aborted if it blows up (but I don't see why it would).
>>
>> Thanks for this fix, will try and get back with results.
>>
> 
> Please make sure to have results just with this change, no messing
> with folio sizes so that I have something for the patch submission.

The contention with inode_lock is gone after your above changes. The new 
top 10 contention data looks like this now:

  contended   total wait     max wait     avg wait         type   caller

2441494015    172.15 h       1.72 ms    253.83 us     spinlock 
folio_wait_bit_common+0xd5
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                         0xffffffffacdd1905  folio_wait_bit_common+0xd5
                         0xffffffffacdd2d0a  filemap_get_pages+0x68a
                         0xffffffffacdd2e73  filemap_read+0x103
                         0xffffffffad1d67ba  blkdev_read_iter+0x6a
                         0xffffffffacf06937  vfs_read+0x297
                         0xffffffffacf07653  ksys_read+0x73
   25269947      1.58 h       1.72 ms    225.44 us     spinlock 
folio_wake_bit+0x62
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacdcf322  folio_wake_bit+0x62
                         0xffffffffacdd2ca7  filemap_get_pages+0x627
                         0xffffffffacdd2e73  filemap_read+0x103
                         0xffffffffad1d67ba  blkdev_read_iter+0x6a
                         0xffffffffacf06937  vfs_read+0x297
                         0xffffffffacf07653  ksys_read+0x73
   44757761      1.05 h       1.55 ms     84.41 us     spinlock 
folio_wake_bit+0x62
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacdcf322  folio_wake_bit+0x62
                         0xffffffffacdcf7bc  folio_end_read+0x2c
                         0xffffffffacf6d4cf  mpage_read_end_io+0x6f
                         0xffffffffad1d8abb  bio_endio+0x12b
                         0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
                         0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
   66419830     53.00 m     685.29 us     47.88 us     spinlock 
__filemap_add_folio+0x14c
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                         0xffffffffacdd037c  __filemap_add_folio+0x14c
                         0xffffffffacdd072c  filemap_add_folio+0x9c
                         0xffffffffacde2a4c  page_cache_ra_unbounded+0x12c
                         0xffffffffacde31e0  page_cache_ra_order+0x340
                         0xffffffffacde33b8  page_cache_sync_ra+0x148
                         0xffffffffacdd27c5  filemap_get_pages+0x145
        532     45.51 m       5.49 s       5.13 s         mutex 
bdev_release+0x69
                         0xffffffffadbef1de  __mutex_lock.constprop.0+0x17e
                         0xffffffffadbef863  __mutex_lock_slowpath+0x13
                         0xffffffffadbef8bb  mutex_lock+0x3b
                         0xffffffffad1d5249  bdev_release+0x69
                         0xffffffffad1d5921  blkdev_release+0x11
                         0xffffffffacf089f3  __fput+0xe3
                         0xffffffffacf08c9b  __fput_sync+0x1b
                         0xffffffffacefe8ed  __x64_sys_close+0x3d
  264989621     45.26 m     486.17 us     10.25 us     spinlock 
raw_spin_rq_lock_nested+0x1d
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5c7f  _raw_spin_lock+0x3f
                         0xffffffffacb4f22d  raw_spin_rq_lock_nested+0x1d
                         0xffffffffacb5b601  _raw_spin_rq_lock_irqsave+0x21
                         0xffffffffacb6f81b  sched_balance_rq+0x60b
                         0xffffffffacb70784  sched_balance_newidle+0x1f4
                         0xffffffffacb70ae4  pick_next_task_fair+0x54
                         0xffffffffacb4e583  __pick_next_task+0x43
   30026842     26.05 m     652.85 us     52.05 us     spinlock 
__folio_mark_dirty+0x2b
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacde189b  __folio_mark_dirty+0x2b
                         0xffffffffacf67503  mark_buffer_dirty+0x53
                         0xffffffffacf676e2  __block_commit_write+0x82
                         0xffffffffacf685fc  block_write_end+0x3c
                         0xffffffffacfb577e  iomap_write_end+0x13e
                         0xffffffffacfb674c  iomap_file_buffered_write+0x29c
    5085820      7.07 m       1.24 ms     83.42 us     spinlock 
folio_wake_bit+0x62
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacdcf322  folio_wake_bit+0x62
                         0xffffffffacdcf7bc  folio_end_read+0x2c
                         0xffffffffacf6d4cf  mpage_read_end_io+0x6f
                         0xffffffffad1d8abb  bio_endio+0x12b
                         0xffffffffad1ee8ca  blk_update_request+0x1aa
                         0xffffffffad1ef7a4  blk_mq_end_request+0x24
    8027370      1.90 m      76.21 us     14.23 us     spinlock 
tick_do_update_jiffies64+0x29
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5c7f  _raw_spin_lock+0x3f
                         0xffffffffacc21c69  tick_do_update_jiffies64+0x29
                         0xffffffffacc21e8c  tick_nohz_handler+0x13c
                         0xffffffffacc0a0cf  __hrtimer_run_queues+0x10f
                         0xffffffffacc0afe8  hrtimer_interrupt+0xf8
                         0xffffffffacaa8f36 
__sysvec_apic_timer_interrupt+0x56
                         0xffffffffadbe3953 
sysvec_apic_timer_interrupt+0x93
    4344269      1.08 m     600.67 us     14.90 us     spinlock 
__filemap_add_folio+0x14c
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                         0xffffffffacdd037c  __filemap_add_folio+0x14c
                         0xffffffffacdd072c  filemap_add_folio+0x9c
                         0xffffffffacde2a4c  page_cache_ra_unbounded+0x12c
                         0xffffffffacde31e0  page_cache_ra_order+0x340
                         0xffffffffacde3615  page_cache_async_ra+0x165
                         0xffffffffacdd2b9c  filemap_get_pages+0x51c

However a point of concern is that FIO bandwidth comes down drastically 
after the change.

		default				inode_lock-fix
rw=30%
Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=9616MiB/s,w=4121MiB/s
Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=8482MiB/s,w=3635MiB/s
Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=8609MiB/s,w=3690MiB/s
Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=8486MiB/s,w=3637MiB/s

Regards,
Bharata.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ