lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 14 Aug 2020 11:52:49 +0530
From:   Ritesh Harjani <ritesh.list@...il.com>
To:     brookxu <brookxu.cn@...il.com>,
        Ritesh Harjani <riteshh@...ux.ibm.com>
Cc:     adilger.kernel@...ger.ca, tytso@....edu, linux-ext4@...r.kernel.org
Subject: Re: [PATCH v4 2/2] ext4: limit the length of per-inode prealloc list



On 8/14/20 6:41 AM, brookxu wrote:
> Thanks for review this patch :)
> 
> Ritesh Harjani wrote on 2020/8/13 21:52:
>> Hi brookxu,
>>
>> Few queries
>>
>> On 8/7/20 2:14 PM, brookxu wrote:
>>> In the scenario of writing sparse files, the Per-inode prealloc list may
>>> be very long, resulting in high overhead for ext4_mb_use_preallocated().
>>> To circumvent this problem, we limit the maximum length of per-inode
>>> prealloc list to 512 and allow users to modify it.
>>>
>>> After patching, we observed that the sys ratio of cpu has dropped, and
>>> the system throughput has increased significantly. We created a process
>>> to write the sparse file, and the running time of the process on the
>>> fixed kernel was significantly reduced, as follows:
>>>
>>> Running time on unfixed kernel:
>>> [root@...CENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
>>> real    0m2.051s
>>> user    0m0.008s
>>> sys     0m2.026s
>>>
>>> Running time on fixed kernel:
>>> [root@...CENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
>>> real    0m0.471s
>>> user    0m0.004s
>>> sys     0m0.395s
>>
>>
>> Please use "--thread" option in your git-format-patch script. So that
>> your patch series comes in a threaded form. It is easier then to apply
>> the series.
>>
>> Somehow I am unable to get my head around how this patch is working.
>> Could you please help explain the theory of operation by answering below
>> queries.
>>
>>>
>>> Signed-off-by: Chunguang Xu <brookxu@...cent.com>
>>> ---
>>>    fs/ext4/ext4.h              |  3 ++-
>>>    fs/ext4/extents.c           | 10 ++++----
>>>    fs/ext4/file.c              |  2 +-
>>>    fs/ext4/indirect.c          |  2 +-
>>>    fs/ext4/inode.c             |  6 ++---
>>>    fs/ext4/ioctl.c             |  2 +-
>>>    fs/ext4/mballoc.c           | 57 +++++++++++++++++++++++++++++++++++++++++----
>>>    fs/ext4/mballoc.h           |  4 ++++
>>>    fs/ext4/move_extent.c       |  4 ++--
>>>    fs/ext4/super.c             |  2 +-
>>>    fs/ext4/sysfs.c             |  2 ++
>>>    include/trace/events/ext4.h | 14 ++++++-----
>>>    12 files changed, 83 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>> index 42f5060..68e0ebe 100644
>>> --- a/fs/ext4/ext4.h
>>> +++ b/fs/ext4/ext4.h
>>> @@ -1501,6 +1501,7 @@ struct ext4_sb_info {
>>>        unsigned int s_mb_stats;
>>>        unsigned int s_mb_order2_reqs;
>>>        unsigned int s_mb_group_prealloc;
>>> +    unsigned int s_mb_max_inode_prealloc;
>>>        unsigned int s_max_dir_size_kb;
>>>        /* where last allocation was done - for stream allocation */
>>>        unsigned long s_mb_last_group;
>>> @@ -2651,7 +2652,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
>>>    extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
>>>                    struct ext4_allocation_request *, int *);
>>>    extern int ext4_mb_reserve_blocks(struct super_block *, int);
>>> -extern void ext4_discard_preallocations(struct inode *);
>>> +extern void ext4_discard_preallocations(struct inode *, unsigned int);
>>>    extern int __init ext4_init_mballoc(void);
>>>    extern void ext4_exit_mballoc(void);
>>>    extern void ext4_free_blocks(handle_t *handle, struct inode *inode,
>>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>>> index 221f240..a40f928 100644
>>> --- a/fs/ext4/extents.c
>>> +++ b/fs/ext4/extents.c
>>> @@ -100,7 +100,7 @@ static int ext4_ext_trunc_restart_fn(struct inode *inode, int *dropped)
>>>         * i_mutex. So we can safely drop the i_data_sem here.
>>>         */
>>>        BUG_ON(EXT4_JOURNAL(inode) == NULL);
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>        up_write(&EXT4_I(inode)->i_data_sem);
>>>        *dropped = 1;
>>>        return 0;
>>> @@ -4272,7 +4272,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>>>                 * not a good idea to call discard here directly,
>>>                 * but otherwise we'd need to call it every free().
>>>                 */
>>> -            ext4_discard_preallocations(inode);
>>> +            ext4_discard_preallocations(inode, 0);
>>>                if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
>>>                    fb_flags = EXT4_FREE_BLOCKS_NO_QUOT_UPDATE;
>>>                ext4_free_blocks(handle, inode, NULL, newblock,
>>> @@ -5299,7 +5299,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>>>        }
>>>
>>>        down_write(&EXT4_I(inode)->i_data_sem);
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>
>>>        ret = ext4_es_remove_extent(inode, punch_start,
>>>                        EXT_MAX_BLOCKS - punch_start);
>>> @@ -5313,7 +5313,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>>>            up_write(&EXT4_I(inode)->i_data_sem);
>>>            goto out_stop;
>>>        }
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>
>>>        ret = ext4_ext_shift_extents(inode, handle, punch_stop,
>>>                         punch_stop - punch_start, SHIFT_LEFT);
>>> @@ -5445,7 +5445,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
>>>            goto out_stop;
>>>
>>>        down_write(&EXT4_I(inode)->i_data_sem);
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>
>>>        path = ext4_find_extent(inode, offset_lblk, NULL, 0);
>>>        if (IS_ERR(path)) {
>>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>>> index 2a01e31..e3ab8ea 100644
>>> --- a/fs/ext4/file.c
>>> +++ b/fs/ext4/file.c
>>> @@ -148,7 +148,7 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
>>>                    !EXT4_I(inode)->i_reserved_data_blocks)
>>>        {
>>>            down_write(&EXT4_I(inode)->i_data_sem);
>>> -        ext4_discard_preallocations(inode);
>>> +        ext4_discard_preallocations(inode, 0);
>>>            up_write(&EXT4_I(inode)->i_data_sem);
>>>        }
>>>        if (is_dx(inode) && filp->private_data)
>>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>>> index be2b66e..ec6b930 100644
>>> --- a/fs/ext4/indirect.c
>>> +++ b/fs/ext4/indirect.c
>>> @@ -696,7 +696,7 @@ static int ext4_ind_trunc_restart_fn(handle_t *handle, struct inode *inode,
>>>         * i_mutex. So we can safely drop the i_data_sem here.
>>>         */
>>>        BUG_ON(EXT4_JOURNAL(inode) == NULL);
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>        up_write(&EXT4_I(inode)->i_data_sem);
>>>        *dropped = 1;
>>>        return 0;
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index 10dd470..bb9e1cd 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -383,7 +383,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
>>>         */
>>>        if ((ei->i_reserved_data_blocks == 0) &&
>>>            !inode_is_open_for_write(inode))
>>> -        ext4_discard_preallocations(inode);
>>> +        ext4_discard_preallocations(inode, 0);
>>>    }
>>>
>>>    static int __check_block_validity(struct inode *inode, const char *func,
>>> @@ -4056,7 +4056,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
>>>        if (stop_block > first_block) {
>>>
>>>            down_write(&EXT4_I(inode)->i_data_sem);
>>> -        ext4_discard_preallocations(inode);
>>> +        ext4_discard_preallocations(inode, 0);
>>>
>>>            ret = ext4_es_remove_extent(inode, first_block,
>>>                            stop_block - first_block);
>>> @@ -4211,7 +4211,7 @@ int ext4_truncate(struct inode *inode)
>>>
>>>        down_write(&EXT4_I(inode)->i_data_sem);
>>>
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>
>>>        if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>>            err = ext4_ext_truncate(handle, inode);
>>> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
>>> index 999cf6a..a5fcc23 100644
>>> --- a/fs/ext4/ioctl.c
>>> +++ b/fs/ext4/ioctl.c
>>> @@ -202,7 +202,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
>>>        reset_inode_seed(inode);
>>>        reset_inode_seed(inode_bl);
>>>
>>> -    ext4_discard_preallocations(inode);
>>> +    ext4_discard_preallocations(inode, 0);
>>>
>>>        err = ext4_mark_inode_dirty(handle, inode);
>>>        if (err < 0) {
>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>> index 4f21f34..28a139f 100644
>>> --- a/fs/ext4/mballoc.c
>>> +++ b/fs/ext4/mballoc.c
>>> @@ -2736,6 +2736,7 @@ int ext4_mb_init(struct super_block *sb)
>>>        sbi->s_mb_stats = MB_DEFAULT_STATS;
>>>        sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
>>>        sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
>>> +    sbi->s_mb_max_inode_prealloc = MB_DEFAULT_MAX_INODE_PREALLOC;
>>>        /*
>>>         * The default group preallocation is 512, which for 4k block
>>>         * sizes translates to 2 megabytes.  However for bigalloc file
>>> @@ -4103,7 +4104,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>>>     *
>>>     * FIXME!! Make sure it is valid at all the call sites
>>>     */
>>> -void ext4_discard_preallocations(struct inode *inode)
>>> +void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>>>    {
>>>        struct ext4_inode_info *ei = EXT4_I(inode);
>>>        struct super_block *sb = inode->i_sb;
>>> @@ -4121,15 +4122,18 @@ void ext4_discard_preallocations(struct inode *inode)
>>>
>>>        mb_debug(sb, "discard preallocation for inode %lu\n",
>>>             inode->i_ino);

Can we please also add "needed" argument in mb_debug() msg here.

With that feel free to add:
Reviewed-by: Ritesh Harjani <riteshh@...ux.ibm.com>


>>> -    trace_ext4_discard_preallocations(inode);
>>> +    trace_ext4_discard_preallocations(inode,  needed);
>>>
>>>        INIT_LIST_HEAD(&list);
>>>
>>> +    if (needed == 0)
>>> +        needed = UINT_MAX;
>>> +
>>>    repeat:
>>>        /* first, collect all pa's in the inode */
>>>        spin_lock(&ei->i_prealloc_lock);
>>> -    while (!list_empty(&ei->i_prealloc_list)) {
>>> -        pa = list_entry(ei->i_prealloc_list.next,
>>> +    while (!list_empty(&ei->i_prealloc_list) && needed) {
>>> +        pa = list_entry(ei->i_prealloc_list.prev,
>>>                    struct ext4_prealloc_space, pa_inode_list);
>>>            BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
>>>            spin_lock(&pa->pa_lock);
>>> @@ -4150,6 +4154,7 @@ void ext4_discard_preallocations(struct inode *inode)
>>>                spin_unlock(&pa->pa_lock);
>>>                list_del_rcu(&pa->pa_inode_list);
>>>                list_add(&pa->u.pa_tmp_list, &list);
>>> +            needed--;
>>>                continue;
>>>            }
>>>
>>> @@ -4549,10 +4554,42 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
>>>    }
>>>
>>>    /*
>>> + * if per-inode prealloc list is too long, trim some PA
>>> + */
>>> +static void
>>> +ext4_mb_trim_inode_pa(struct inode *inode)
>>> +{
>>> +    struct ext4_inode_info *ei = EXT4_I(inode);
>>> +    struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>>> +    struct ext4_prealloc_space *pa;
>>> +    int count = 0, delta;
>>> +
>>> +    rcu_read_lock();
>>> +    list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
>>> +        spin_lock(&pa->pa_lock);
>>> +        if (pa->pa_deleted) {
>>> +            spin_unlock(&pa->pa_lock);
>>> +            continue;
>>> +        }
>>> +        count++;
>>> +        spin_unlock(&pa->pa_lock);
>>> +    }
>>> +    rcu_read_unlock();
>>> +
>>> +    delta = (sbi->s_mb_max_inode_prealloc >> 2) + 1;
>>> +    if (count > sbi->s_mb_max_inode_prealloc + delta) {
>>> +        count -= sbi->s_mb_max_inode_prealloc;
>>> +        ext4_discard_preallocations(inode, count);
>>> +    }
>>> +}
>>> +
>>> +/*
>>>     * release all resource we used in allocation
>>>     */
>>>    static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>>>    {
>>> +    struct inode *inode = ac->ac_inode;
>>> +    struct ext4_inode_info *ei = EXT4_I(inode);
>>>        struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>>>        struct ext4_prealloc_space *pa = ac->ac_pa;
>>>        if (pa) {
>>> @@ -4578,6 +4615,17 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>>>                    ext4_mb_add_n_trim(ac);
>>>                }
>>>            }
>>> +
>>> +        if (pa->pa_type == MB_INODE_PA) {
>>> +            /*
>>> +             * treat per-inode prealloc list as a lru list, then try
>>> +             * to trim the least recently used PA.
>>> +             */
>>> +            spin_lock(pa->pa_obj_lock);
>>> +            list_move(&ei->i_prealloc_list, &pa->pa_inode_list);
>>
>> So what you are doing here is essentially moving back ei->i_prealloc_list back into pa->pa_inode_lis
> 
> The operation here uses LRU to maintain the prealloc list, so we can trim the least recently used PA
> from the tail of the list. According to the principle of locality, this may reduce the impact on the
> program.
> 
>> Then in ext4_mb_trim_inode_pa(), it will traverse over i_prealloc_list
>> which is empty anyways no? (considering single threaded write here).
>> I am definitely missing something, could you help explain how is it
>> getting trimmed then?
> 

Sorry, I had got confused here with something else. This looks fine to 
me now :)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ