lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 12 Jan 2021 21:45:00 +0800 (CST)
From:   Zhongwei Cai <sunrise_l@...u.edu.cn>
To:     Mingkai Dong <mingkaidong@...il.com>,
        Matthew Wilcox <willy@...radead.org>
Cc:     Mikulas Patocka <mpatocka@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.cz>,
        Steven Whitehouse <swhiteho@...hat.com>,
        Eric Sandeen <esandeen@...hat.com>,
        Dave Chinner <dchinner@...hat.com>,
        Theodore Ts'o <tytso@....edu>,
        Wang Jianchao <jianchao.wan9@...il.com>,
        "Tadakamadla, Rajesh" <rajesh.tadakamadla@....com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-nvdimm@...ts.01.org
Subject: Re: Expense of read_iter


I'm working with Mingkai on optimizations for Ext4-dax.
We think that optmizing the read-iter method cannot achieve the
same performance as the read method for Ext4-dax. 
We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
results are listed below:

Overall time of 2^26 4KB read.

Method       Time
read         26.782s
read-iter    36.477s

Perf result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13379476464
#
# Overhead  Command  Shared Object     Symbol                                 
# ........  .......  ................  .......................................
#
    20.09%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.58%  pread    [kernel.vmlinux]  [k] iomap_apply
     6.01%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     4.85%  pread    libc-2.31.so      [.] __libc_pread
     3.61%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     3.25%  pread    [kernel.vmlinux]  [k] _raw_read_lock
     2.80%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     2.71%  pread    [ext4]            [k] ext4_es_lookup_extent
     2.71%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.63%  pread    [kernel.vmlinux]  [k] __srcu_read_unlock
     2.55%  pread    [kernel.vmlinux]  [k] new_sync_read
     2.39%  pread    [ext4]            [k] ext4_iomap_begin
     2.38%  pread    [kernel.vmlinux]  [k] vfs_read
     2.30%  pread    [kernel.vmlinux]  [k] dax_iomap_actor
     2.30%  pread    [kernel.vmlinux]  [k] __srcu_read_lock
     2.14%  pread    [ext4]            [k] ext4_inode_block_valid
     1.97%  pread    [kernel.vmlinux]  [k] _copy_mc_to_iter
     1.97%  pread    [ext4]            [k] ext4_map_blocks
     1.89%  pread    [kernel.vmlinux]  [k] down_read
     1.89%  pread    [kernel.vmlinux]  [k] up_read
     1.65%  pread    [ext4]            [k] ext4_file_read_iter
     1.48%  pread    [kernel.vmlinux]  [k] dax_iomap_rw
     1.48%  pread    [jbd2]            [k] jbd2_transaction_committed
     1.15%  pread    [nd_pmem]         [k] __pmem_direct_access
     1.15%  pread    [kernel.vmlinux]  [k] ksys_pread64
     1.15%  pread    [kernel.vmlinux]  [k] __fget_light
     1.15%  pread    [ext4]            [k] ext4_set_iomap
     1.07%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.82%  pread    pread             [.] main
     0.82%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.74%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.66%  pread    [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
     0.66%  pread    [nd_pmem]         [k] 0x00000000000001d0
     0.59%  pread    [kernel.vmlinux]  [k] dax_direct_access
     0.58%  pread    [nd_pmem]         [k] 0x00000000000001de
     0.58%  pread    [kernel.vmlinux]  [k] bdev_dax_pgoff
     0.49%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.49%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.49%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.41%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.33%  pread    [nd_pmem]         [k] 0x0000000000001083
     0.33%  pread    [kernel.vmlinux]  [k] dax_get_private
     0.33%  pread    [kernel.vmlinux]  [k] timestamp_truncate
     0.33%  pread    [kernel.vmlinux]  [k] percpu_counter_add_batch
     0.33%  pread    [kernel.vmlinux]  [k] copyout_mc
     0.33%  pread    [ext4]            [k] __check_block_validity.constprop.80
     0.33%  pread    [kernel.vmlinux]  [k] touch_atime
     0.25%  pread    [nd_pmem]         [k] 0x000000000000107f
     0.25%  pread    [kernel.vmlinux]  [k] rw_verify_area
     0.25%  pread    [ext4]            [k] ext4_iomap_end
     0.25%  pread    [kernel.vmlinux]  [k] _cond_resched
     0.25%  pread    [kernel.vmlinux]  [k] rcu_all_qs
     0.16%  pread    [kernel.vmlinux]  [k] __fdget
     0.16%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.16%  pread    [kernel.vmlinux]  [k] iov_iter_init
     0.16%  pread    [kernel.vmlinux]  [k] current_time
     0.16%  pread    [nd_pmem]         [k] 0x0000000000001075
     0.16%  pread    [ext4]            [k] ext4_inode_datasync_dirty
     0.16%  pread    [kernel.vmlinux]  [k] copy_mc_to_user
     0.08%  pread    pread             [.] pread@plt
     0.08%  pread    [kernel.vmlinux]  [k] __x86_indirect_thunk_r11
     0.08%  pread    [kernel.vmlinux]  [k] security_file_permission
     0.08%  pread    [kernel.vmlinux]  [k] dax_read_unlock
     0.08%  pread    [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
     0.08%  pread    [nd_pmem]         [k] 0x000000000000108f
     0.08%  pread    [nd_pmem]         [k] 0x0000000000001095
     0.08%  pread    [kernel.vmlinux]  [k] rcu_read_unlock_strict
     0.00%  pread    [kernel.vmlinux]  [k] native_write_msr


#
# (Tip: Show current config key-value pairs: perf config --list)
#

Perf result, using the read method we added for Ext4-dax:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13364755903
#
# Overhead  Command  Shared Object     Symbol                                 
# ........  .......  ................  .......................................
#
    28.65%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     7.99%  pread    [ext4]            [k] ext4_dax_read
     6.50%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     5.43%  pread    libc-2.31.so      [.] __libc_pread
     4.45%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     4.20%  pread    [kernel.vmlinux]  [k] down_read
     3.38%  pread    [kernel.vmlinux]  [k] _raw_read_lock
     3.13%  pread    [ext4]            [k] ext4_es_lookup_extent
     3.05%  pread    [kernel.vmlinux]  [k] __srcu_read_lock
     2.72%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.55%  pread    [kernel.vmlinux]  [k] __srcu_read_unlock
     2.47%  pread    [kernel.vmlinux]  [k] vfs_read
     2.31%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     1.89%  pread    [kernel.vmlinux]  [k] up_read
     1.73%  pread    [ext4]            [k] ext4_map_blocks
     1.65%  pread    pread             [.] main
     1.56%  pread    [kernel.vmlinux]  [k] __fget_light
     1.48%  pread    [ext4]            [k] ext4_inode_block_valid
     1.34%  pread    [kernel.vmlinux]  [k] ksys_pread64
     1.23%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     1.08%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     1.07%  pread    [nd_pmem]         [k] __pmem_direct_access
     0.99%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.91%  pread    [kernel.vmlinux]  [k] security_file_permission
     0.91%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.66%  pread    [kernel.vmlinux]  [k] timestamp_truncate
     0.58%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.49%  pread    pread             [.] pread@plt
     0.41%  pread    [kernel.vmlinux]  [k] current_time
     0.41%  pread    [kernel.vmlinux]  [k] dax_direct_access
     0.41%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.41%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.41%  pread    [kernel.vmlinux]  [k] percpu_counter_add_batch
     0.33%  pread    [kernel.vmlinux]  [k] touch_atime
     0.33%  pread    [ext4]            [k] __check_block_validity.constprop.80
     0.33%  pread    [kernel.vmlinux]  [k] copy_mc_to_user
     0.25%  pread    [kernel.vmlinux]  [k] dax_get_private
     0.25%  pread    [kernel.vmlinux]  [k] rcu_all_qs
     0.25%  pread    [nd_pmem]         [k] 0x0000000000001095
     0.16%  pread    [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
     0.16%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.16%  pread    [nd_pmem]         [k] 0x0000000000001083
     0.16%  pread    [kernel.vmlinux]  [k] rw_verify_area
     0.16%  pread    [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
     0.16%  pread    [kernel.vmlinux]  [k] __fdget
     0.16%  pread    [kernel.vmlinux]  [k] dax_read_lock
     0.16%  pread    [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
     0.08%  pread    [kernel.vmlinux]  [k] rcu_read_unlock_strict
     0.08%  pread    [kernel.vmlinux]  [k] dax_read_unlock
     0.08%  pread    [kernel.vmlinux]  [k] update_irq_load_avg
     0.08%  pread    [nd_pmem]         [k] 0x000000000000109d
     0.08%  pread    [nd_pmem]         [k] 0x000000000000107a
     0.08%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     0.00%  pread    [kernel.vmlinux]  [k] native_write_msr


#
# (Tip: Sample related events with: perf record -e '{cycles,instructions}:S')
#

Note that the overall time of read method is 73.42% of the read-iter method.
If we sum up the percentage of read-iter specific functions (including
ext4_file_read_iter, iomap_apply, dax_iomap_actor, _copy_mc_to_iter,
ext4_iomap_begin, jbd2_transaction_committed, new_sync_read, dax_iomap_rw,
ext4_set_iomap, ext4_iomap_end and iov_iter_init), we will get 20.81%.
In the second trace, ext4_dax_read only consumes 7.99%, which can replace
all these functions.

The overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

There could be two approaches to optimizing: 1) implementing the read method
without the complexity of iterators and dax_iomap_rw; 2) optimizing both
iterators and how dax_iomap_rw works. Since dax_iomap_rw requires
ext4_iomap_begin, which further involves the iomap structure and others
(e.g., journaling status locks in Ext4), we think implementing the read
method would be easier.

Thanks,
Zhongwei

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ