linux-ext4 - Re: [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20200421042458.40BB74C046@d06av22.portsmouth.uk.ibm.com>
Date:   Tue, 21 Apr 2020 09:54:57 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org, tytso@....edu,
        adilger.kernel@...ger.ca, linux-fsdevel@...r.kernel.org,
        aneesh.kumar@...ux.ibm.com, sandeen@...deen.net
Subject: Re: [RFCv2 1/1] ext4: Fix race in ext4 mb discard group
 preallocations

Hello Jan,

Thanks for your email.

On 4/20/20 8:08 PM, Jan Kara wrote:
> On Wed 15-04-20 22:53:01, Ritesh Harjani wrote:
>> There could be a race in function ext4_mb_discard_group_preallocations()
>> where the 1st thread may iterate through group's bb_prealloc_list and
>> remove all the PAs and add to function's local list head.
>> Now if the 2nd thread comes in to discard the group preallocations,
>> it will see that the group->bb_prealloc_list is empty and will return 0.
>>
>> Consider for a case where we have less number of groups (for e.g. just group 0),
>> this may even return an -ENOSPC error from ext4_mb_new_blocks()
>> (where we call for ext4_mb_discard_group_preallocations()).
>> But that is wrong, since 2nd thread should have waited for 1st thread to release
>> all the PAs and should have retried for allocation. Since 1st thread
>> was anyway going to discard the PAs.
>>
>> This patch fixes this race by introducing two paths (fastpath and
>> slowpath). We first try the fastpath via
>> ext4_mb_discard_preallocations(). So if any of the group's PA list is
>> empty then instead of waiting on the group_lock we continue to discard
>> other group's PA. This could help maintain the parallelism in trying to
>> discard multiple group's PA list. So if at the end some process is
>> not able to find any freed block, then we retry freeing all of the
>> groups PA list while holding the group_lock. And in case if the PA list
>> is empty, then we try return grp->bb_free which should tell us
>> whether there are any free blocks in the given group or not to make any
>> forward progress.
>>
>> Suggested-by: Aneesh Kumar K.V <aneesh.kumar@...ux.ibm.com>
>> Signed-off-by: Ritesh Harjani <riteshh@...ux.ibm.com>
> 
> Ritesh, do you still want to push this patch as is or do you plan to change
> it after a discussion on Thursday?

I would need more time on this. I will get back on this, once I do some
study to see how your suggested approach works out.

-ritesh


> 
>> @@ -3967,9 +3986,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   		goto repeat;
>>   	}
>>   
>> -	/* found anything to free? */
>> +	/*
>> +	 * If this list is empty, then return the grp->bb_free. As someone
>> +	 * else may have freed the PAs and updated grp->bb_free.
>> +	 */
>>   	if (list_empty(&list)) {
>>   		BUG_ON(free != 0);
>> +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
>> +			 group, grp->bb_free);
>> +		free = grp->bb_free;
>>   		goto out;
>>   	}
> 
> I'm still somewhat concerned about the forward progress guarantee here...
> If you're convinced the allocation from goal is the only possibility of
> lockup and that logic can be removed, then please remove it and then write a
> good comment why lockup is not possible due to this.
> 
>> @@ -4464,17 +4492,39 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>>   	return 0;
>>   }
>>   
>> +/*
>> + * ext4_mb_discard_preallocations: This function loop over each group's prealloc
>> + * list and try to free it. It may so happen that more than 1 process try to
>> + * call this function in parallel. That's why we initially take a fastpath
>> + * approach in which we first check if the grp->bb_prealloc_list is empty,
>> + * that could mean that, someone else may have removed all of it's PA and added
>> + * into it's local list. So we quickly return from there and try to discard
>> + * next group's PAs. This way we try to parallelize discarding of multiple group
>> + * PAs. But in case if any of the process is unfortunate to not able to free
>> + * any of group's PA, then we retry with slow path which will gurantee that
>> + * either some PAs will be made free or we will get group->bb_free blocks
>> + * (grp->bb_free if non-zero gurantees forward progress in ext4_mb_new_blocks())
>> + */
>>   static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>>   {
>>   	ext4_group_t i, ngroups = ext4_get_groups_count(sb);
>>   	int ret;
>>   	int freed = 0;
>> +	bool fastpath = true;
>> +	int tmp_needed;
>>   
>> -	trace_ext4_mb_discard_preallocations(sb, needed);
>> -	for (i = 0; i < ngroups && needed > 0; i++) {
>> -		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
>> +repeat:
>> +	tmp_needed = needed;
>> +	trace_ext4_mb_discard_preallocations(sb, tmp_needed);
>> +	for (i = 0; i < ngroups && tmp_needed > 0; i++) {
>> +		ret = ext4_mb_discard_group_preallocations(sb, i, tmp_needed,
>> +							   fastpath);
>>   		freed += ret;
>> -		needed -= ret;
>> +		tmp_needed -= ret;
>> +	}
> 
> Why do you need 'tmp_needed'? When freed is 0, tmp_needed == needed, right?
> 
>> +	if (!freed && fastpath) {
>> +		fastpath = false;
>> +		goto repeat;
>>   	}
> 
> 								Honza
>