[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20200430053906.7171AA405C@b06wcsmtp001.portsmouth.uk.ibm.com>
Date: Thu, 30 Apr 2020 11:09:04 +0530
From: Ritesh Harjani <riteshh@...ux.ibm.com>
To: "Paul E. McKenney" <paulmck@...nel.org>, rcu@...r.kernel.org
Cc: Ext4 Developers List <linux-ext4@...r.kernel.org>,
Jan Kara <jack@...e.cz>, "Theodore Ts'o" <tytso@....edu>
Subject: Re: Discussion about rcu_barrier() in ext4
Hi Paul,
Sorry about this email. While rebasing on my test environment, I guess a
fix over the patch mentioned below was absent - which could possibly
cause an indefinite loop. Looking into it currently.
-ritesh
On 4/30/20 10:11 AM, Ritesh Harjani wrote:
> Hello Paul,
>
> I added a rcu_barrier() in one of the ext4 path, to ensure
> that once rcu_barrier is completed, all the call_rcu() should have
> been completed too. I think I am seeing a below deadlock while running
> one of the ext4 fstest.
>
> Could you please help me understand what exactly is causing the deadlock
> here? I was thinking is it because of khungtaskd inside rcu_read_lock()
> which would have disabled the preemption and that's why the
> rcu_barrier() path is stuck? But why can't khungtaskd proceed?
>
> Also about when is it safe or not safe to call rcu_barrier()?
>
> Appreciate your inputs on this one!!
>
> -ritesh
>
> <Call Stack>
> =============
>
> [ 1476.770791] Not tainted 5.6.0-rc4+ #47
> [ 1476.771723] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1476.773814] kworker/u32:3 D12136 9612 2 0x80004000
> [ 1476.775299] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> [ 1476.776904] Call Trace:
> [ 1476.777640] ? __schedule+0x35b/0x8b0
> [ 1476.778687] ? rwsem_down_write_slowpath+0x405/0x650
> [ 1476.780010] schedule+0x55/0xd0
> [ 1476.780763] rwsem_down_write_slowpath+0x40a/0x650
> [ 1476.781867] ? down_write+0x10c/0x110
> [ 1476.782687] down_write+0x10c/0x110
> [ 1476.783471] ext4_map_blocks+0xd4/0x640
> [ 1476.784367] ext4_convert_unwritten_extents+0x10f/0x210
> [ 1476.785493] ext4_convert_unwritten_io_end_vec+0x5f/0xd0
> [ 1476.786710] ext4_end_io_rsv_work+0xe6/0x190
> [ 1476.787659] process_one_work+0x24e/0x5a0
> [ 1476.788568] worker_thread+0x3c/0x390
> [ 1476.789396] ? process_one_work+0x5a0/0x5a0
> [ 1476.790312] kthread+0x120/0x140
> [ 1476.790958] ? kthread_park+0x80/0x80
> [ 1476.791750] ret_from_fork+0x3a/0x50
> [ 1476.792578]
> [ 1476.792578] Showing all locks held in the system:
> [ 1476.793885] 1 lock held by khungtaskd/92:
> [ 1476.794744] #0: ffffffff82a7af20 (rcu_read_lock){....}, at:
> debug_show_all_locks+0x15/0x17e
> [ 1476.796485] 6 locks held by kworker/u32:2/194:
> [ 1476.797424] #0: ffff888a0521d548 ((wq_completion)writeback){+.+.},
> at: process_one_work+0x1ce/0x5a0
> [ 1476.799293] #1: ffffc9000185be68
> ((work_completion)(&(&wb->dwork)->work)){+.+.}, at:
> process_one_work+0x1ce/0x5a0
> [ 1476.801386] #2: ffff8889c6f31af0 (&sbi->s_writepages_rwsem){++++},
> at: do_writepages+0x41/0xe0
> [ 1476.803178] #3: ffff8889c46cd838 (jbd2_handle){++++}, at:
> start_this_handle+0x1ac/0x680
> [ 1476.804853] #4: ffff8889f2101198 (&ei->i_data_sem){++++}, at:
> ext4_map_blocks+0xd4/0x640
> [ 1476.806535] #5: ffffffff82a7ea78 (rcu_state.barrier_mutex){+.+.},
> at: rcu_barrier+0x59/0x760
> [ 1476.808296] 1 lock held by in:imklog/650:
> [ 1476.809162] 4 locks held by kworker/u32:3/9612:
> [ 1476.810159] #0: ffff88899e7f6548
> ((wq_completion)ext4-rsv-conversion#2){+.+.}, at:
> process_one_work+0x1ce/0x5a0
> [ 1476.812227] #1: ffffc9000dd4be68
> ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}, at:
> process_one_work+0x1ce/0x5a0
> [ 1476.814432] #2: ffff8889c46cd838 (jbd2_handle){++++}, at:
> start_this_handle+0x1ac/0x680
> [ 1476.816085] #3: ffff8889f2101198 (&ei->i_data_sem){++++}, at:
> ext4_map_blocks+0xd4/0x640
> [ 1476.817799] 3 locks held by dd/17439:
> [ 1476.818590] #0: ffff8889be355490 (sb_writers#3){.+.+}, at:
> vfs_write+0x177/0x1d0
> [ 1476.820132] #1: ffff8889f21013d0
> (&sb->s_type->i_mutex_key#10){+.+.}, at:
> ext4_buffered_write_iter+0x33/0x120
> [ 1476.822183] #2: ffff8889be3550e8 (&type->s_umount_key#35){++++}, at:
> try_to_writeback_inodes_sb+0x1b/0x50
> [ 1476.824142] 2 locks held by debugfs/17530:
> [ 1476.824962] #0: ffff8889e3f108a0 (&tty->ldisc_sem){++++}, at:
> tty_ldisc_ref_wait+0x24/0x50
> [ 1476.826676] #1: ffffc900004672f0 (&ldata->atomic_read_lock){+.+.},
> at: n_tty_read+0xda/0x9d0
> [ 1476.828455]
> [ 1476.828773] =============================================
> [ 1476.828773]
>
>
> Below is the patch which I added which started causing this.
> ===========================================================
>
> ---
> Subject: [PATCH 1/1] ext4: Introduce percpu seq counter for freeing
> blocks(PA) to avoid ENOSPC err
>
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
>
> Consider for a case where we have less number of groups (for e.g. just
> group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread to
> release
> all the PAs and should have retried for allocation. Since 1st thread
> was anyway going to discard the PAs.
>
> The algorithm using this percpu seq counter goes below:
> 1. We sample the percpu discard_pa_seq counter before trying for block
> allocation in ext4_mb_new_blocks().
> 2. We increment this percpu discard_pa_seq counter when we succeed in
> allocating blocks and hence while adding the remaining blocks in
> group's
> bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
> 3. We also increment this percpu seq counter when we successfully identify
> that the bb_prealloc_list is not empty and hence proceed for discarding
> of those PAs inside ext4_mb_discard_group_preallocations().
>
> Now to make sure that the regular fast path of block allocation is not
> affected, as a small optimization we only sample the percpu seq counter
> on that cpu. Only when the block allocation fails and when freed blocks
> found were 0, that is when we sample percpu seq counter for all cpus using
> below function ext4_get_discard_pa_seq_sum(). This happens after making
> sure that all the PAs on grp->bb_prealloc_list got freed.
>
> Note: The other method was also to check for grp->bb_free next time
> if we couldn't discard anything. But one suspicion with that was that if
> grp->bb_free is non-zero for some reason (not sure), then we may result
> in that loop indefinitely but still won't be able to satisfy any
> request. Second, to check for grp->bb_free all threads were to wait on
> grp's spin_lock which might result in increased cpu usage.
> So it seems this could be a better & faster approach compared to that.
>
> Signed-off-by: Ritesh Harjani <riteshh@...ux.ibm.com>
> ---
> fs/ext4/mballoc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 53 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index fe6528716f14..47ff3704d2fc 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -357,6 +357,34 @@ static void ext4_mb_generate_from_pa(struct
> super_block *sb, void *bitmap,
> static void ext4_mb_generate_from_freelist(struct super_block *sb,
> void *bitmap,
> ext4_group_t group);
>
> +/*
> + * The algorithm using this percpu seq counter goes below:
> + * 1. We sample the percpu discard_pa_seq counter before trying for block
> + * allocation in ext4_mb_new_blocks().
> + * 2. We increment this percpu discard_pa_seq counter when we succeed in
> + * allocating blocks and hence while adding the remaining blocks in
> group's
> + * bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
> + * 3. We also increment this percpu seq counter when we successfully
> identify
> + * that the bb_prealloc_list is not empty and hence proceed for
> discarding
> + * of those PAs inside ext4_mb_discard_group_preallocations().
> + *
> + * Now to make sure that the regular fast path of block allocation is not
> + * affected, as a small optimization we only sample the percpu seq counter
> + * on that cpu. Only when the block allocation fails and when freed blocks
> + * found were 0, that is when we sample percpu seq counter for all cpus
> using
> + * below function ext4_get_discard_pa_seq_sum(). This happens after making
> + * sure that all the PAs on grp->bb_prealloc_list got freed.
> + */
> +DEFINE_PER_CPU(u64, discard_pa_seq);
> +static inline u64 ext4_get_discard_pa_seq_sum(void)
> +{
> + int __cpu;
> + u64 __seq = 0;
> + for_each_possible_cpu(__cpu)
> + __seq += per_cpu(discard_pa_seq, __cpu);
> + return __seq;
> +}
> +
> static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
> {
> #if BITS_PER_LONG == 64
> @@ -3721,6 +3749,7 @@ ext4_mb_new_inode_pa(struct
> ext4_allocation_context *ac)
> pa->pa_inode = ac->ac_inode;
>
> ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> + this_cpu_inc(discard_pa_seq);
> list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
> ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>
> @@ -3782,6 +3811,7 @@ ext4_mb_new_group_pa(struct
> ext4_allocation_context *ac)
> pa->pa_inode = NULL;
>
> ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> + this_cpu_inc(discard_pa_seq);
> list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
> ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>
> @@ -3934,6 +3964,7 @@ ext4_mb_discard_group_preallocations(struct
> super_block *sb,
> INIT_LIST_HEAD(&list);
> repeat:
> ext4_lock_group(sb, group);
> + this_cpu_inc(discard_pa_seq);
> list_for_each_entry_safe(pa, tmp,
> &grp->bb_prealloc_list, pa_group_list) {
> spin_lock(&pa->pa_lock);
> @@ -4481,13 +4512,30 @@ static int ext4_mb_discard_preallocations(struct
> super_block *sb, int needed)
> }
>
> static bool ext4_mb_discard_preallocations_should_retry(struct
> super_block *sb,
> - struct ext4_allocation_context *ac)
> + struct ext4_allocation_context *ac, u64 seq)
> {
> int freed;
> + u64 seq_retry;
> +
> freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
> if (freed)
> - return true;
> + goto out_retry;
> + /*
> + * Unless it is ensured that PAs are actually freed, we may hit
> + * a ENOSPC error since the next time seq may match while the PA
> blocks
> + * are still getting freed in ext4_mb_release_inode/group_pa().
> + * So, rcu_barrier() here is to make sure that any call_rcu queued in
> + * ext4_mb_discard_group_preallocations() is completed before we
> + * proceed further to retry for block allocation.
> + */
> + rcu_barrier();
> + seq_retry = ext4_get_discard_pa_seq_sum();
> + if (seq_retry != seq)
> + goto out_retry;
> +
> return false;
> +out_retry:
> + return true;
> }
>
> /*
> @@ -4504,6 +4552,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> ext4_fsblk_t block = 0;
> unsigned int inquota = 0;
> unsigned int reserv_clstrs = 0;
> + u64 seq;
>
> might_sleep();
> sb = ar->inode->i_sb;
> @@ -4565,6 +4614,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> }
>
> ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
> + seq = *this_cpu_ptr(&discard_pa_seq);
> if (!ext4_mb_use_preallocated(ac)) {
> ac->ac_op = EXT4_MB_HISTORY_ALLOC;
> ext4_mb_normalize_request(ac, ar);
> @@ -4596,7 +4646,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> ar->len = ac->ac_b_ex.fe_len;
> }
> } else {
> - if (ext4_mb_discard_preallocations_should_retry(sb, ac))
> + if (ext4_mb_discard_preallocations_should_retry(sb, ac, seq))
> goto repeat;
> *errp = -ENOSPC;
> }
Powered by blists - more mailing lists