[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200408170647.31019A4040@d06av23.portsmouth.uk.ibm.com>
Date: Wed, 8 Apr 2020 22:36:44 +0530
From: Ritesh Harjani <riteshh@...ux.ibm.com>
To: linux-ext4@...r.kernel.org
Cc: tytso@....edu, jack@...e.cz, adilger.kernel@...ger.ca,
linux-fsdevel@...r.kernel.org,
"Aneesh Kumar K . V" <aneesh.kumar@...ux.ibm.com>,
sandeen@...deen.net
Subject: Re: [RFC 0/1] ext4: Fix mballoc race in freeing up group
preallocations
Sorry about the typo.
ext4_mb_group_discard_preallocations() should be
ext4_mb_discard_group_preallocations()
-ritesh
On 4/8/20 10:24 PM, Ritesh Harjani wrote:
> Hello All,
>
> There seems to be a below race with current
> ext4_mb_group_discard_preallocations() function. This could be fairly
> easily reproduced on a system with only 1 group. But internally this was even
> reported with more than 1 group. We could reproduce this on both 64K and 4K
> blocksize filesystem.
>
> This is a RFC patch sent out for reviews, feedback and/or any comments.
> Below mail provide all the necessary details.
>
> Test setup
> ==========
> 1. It's a multithreaded test case where each thread is trying to create a
> file using open() -> ftruncate().
> 2. Then we are doing mmap of this file for filesize bytes.
> 3. Then start writing sequentially byte by byte for full filesize.
>
> Details for creating it on loop device:-
> ======================================
> 1. truncate -s 240M filefs (easier on a smaller filesystem)
> 2. losetup /dev/loop0 filefs
> 3. mkfs.ext4 -F -b 65536 /dev/loop0
> 4. mkdir mnt
> 5. mount -t ext4 /dev/loop0 mnt/
> 6. cd mnt/
> 7. Start running the test case mentioned above in above "Test setup".
> (for our test we were keeping no. of threads to ~70 and filling the available
> filesystem space (df -h) to around ~80%/70%. Based on that each filesize is
> calculated).
> 8. cd .. (once the test finishes)
> 9. umount mnt
> 10. Go to step 3.
>
> Test (test-ext4-mballoc.c) file and script which does the
> unmount/mount and run the ./a.out is mentioned at [1], [2].
>
>
> Analysis:-
> ==========
>
> It seems below race could be occurring
> P1 P2
> ext4_mb_new_blocks() |
> | ext4_mb_new_blocks()
> |
> ext4_mb_group_discard_preallocations() |
> | ext4_mb_group_discard_preallocations()
> if (list_empty(&grp->bb_prealloc_list)
> return 0; |
> | |
> ext4_lock_group() |
> list_for_each_entry_safe() { |
> <..> |
> list_del(&pa->pa_group_list); |
> list_add(&pa->u.pa_tmp_list, &list) |
> } |
> |
> processing-local-list() if(list_empty(&grp->bb_prealloc_list)
> | return 0
> <...>
> ext4_unlock_group()
>
> What we see here is that, there are multiple threads which are trying to allocate.
> But since there is not enough space, they try to discard the group preallocations.
> (will be easy to understand if we only consider group 0, though it could
> be reproduced with multiple groups as well).
> Now while more than 1 thread tries to free up the group preallocations, there
> could be 1 thread (P2) which sees that the bb_prealloc_list is already
> empty and will assume that there is nothing to free from here. Hence return 0.
> Now consider this happens with thread P2 for all other groups as well (where some other
> thread came in early and freed up the group preallocations). At that point,
> P2 sees that the total freed blocks returned by ext4_mb_discard_preallocations()
> back to ext4_mb_new_blocks() is 0 and hence it does not retry the allocation,
> instead it returns -ENOSPC error.
>
> This causes SIGBUS to the application who was doing an mmap write. Once
> the application crashes we could still see that the filesystem available space
> is more than ~70-80% (worst case scenario). So ideally P2 should have waited
> for P1 to get over and should have checked how much P1 could free up.
>
>
> Solution (based on the understanding of the mballoc code)
> =========================================================
>
> We think that it is best to check if there is anything to be freed
> within ext4_group_lock(). i.e. to check if the bb_prealloc_list is empty.
> This patch attempts to fix this race by checking if nothing could be collected
> in the local list. This could mean that someone else might have freed
> all of this group PAs for us. So simply return group->bb_free which
> should also give us an upper bound on the total available space for
> allocation in this group.
>
> We need not worry about the fast path of whether the list is empty without
> taking ext4_group_lock(), since we are anyway in the slow path where the
> ext4_mb_regular_allocator() failed and hence we are now desperately trying
> to discard all the group PAs to free up some space for allocation.
>
>
> Please correct if any of below understanding is incorrect:-
> ==========================================================
> 1. grp->bb_free is the available number of free blocks in that group for
> allocation by anyone.
> 2. If grp->bb_free is non-zero and we call ext4_mb_regular_allocator(ac),
> then it will always return ac->ac_status == AC_STATUS_FOUND
> (and it could even allocate and return less than the requested no. of blocks).
> 3. There shouldn't be any infinte loop in ext4_mb_new_blocks() after we
> return grp->bb_free in this patch.
> (i.e. between ext4_mb_regular_allocator() and ext4_mb_discard_preallocations())
> It could only happen if ext4_mb_regular_allocator cannot make any forward
> progress even if grp->bb_free is non-zero.
> But IIUC, that won't happen. Please correct here.
>
> Tests run:-
> ==========
> For now I have only done unit testing with the shared test code [1] [2].
> Wanted to post this RFC for review comments/discussion.
>
> Resources:
> ==========
> [1] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test-ext4-mballoc.c
> [2] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test_mballoc.sh
>
> Ritesh Harjani (1):
> ext4: Fix race in ext4_mb_discard_group_preallocations()
>
> fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
> 1 file changed, 24 insertions(+), 7 deletions(-)
>
Powered by blists - more mailing lists